date:20200907

[spark] branch master updated (4144b6d -> 125cbe3)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0
 add 125cbe3  [SPARK-32736][CORE] Avoid caching the removed decommissioned 
executors in TaskSchedulerImpl

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ExecutorAllocationManager.scala   |  2 +-
 .../org/apache/spark/deploy/DeployMessage.scala|  4 +-
 .../spark/deploy/client/StandaloneAppClient.scala  |  6 +--
 .../client/StandaloneAppClientListener.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala| 11 ++--
 .../executor/CoarseGrainedExecutorBackend.scala|  7 +--
 .../org/apache/spark/internal/config/package.scala | 10 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 17 +++---
 .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++--
 .../spark/scheduler/ExecutorLossReason.scala   | 10 ++--
 .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++---
 .../apache/spark/scheduler/TaskSetManager.scala|  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++
 .../cluster/StandaloneSchedulerBackend.scala   |  7 ++-
 .../spark/deploy/client/AppClientSuite.scala   |  4 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 --
 .../spark/scheduler/TaskSchedulerImplSuite.scala   | 36 -
 .../spark/scheduler/TaskSetManagerSuite.scala  |  9 ++--
 .../WorkerDecommissionExtendedSuite.scala  |  2 +-
 .../spark/scheduler/WorkerDecommissionSuite.scala  |  2 +-
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../scheduler/ExecutorAllocationManager.scala  |  2 +-
 .../scheduler/ExecutorAllocationManagerSuite.scala |  2 +-
 23 files changed, 98 insertions(+), 164 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4144b6d -> 125cbe3)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0
 add 125cbe3  [SPARK-32736][CORE] Avoid caching the removed decommissioned 
executors in TaskSchedulerImpl

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ExecutorAllocationManager.scala   |  2 +-
 .../org/apache/spark/deploy/DeployMessage.scala|  4 +-
 .../spark/deploy/client/StandaloneAppClient.scala  |  6 +--
 .../client/StandaloneAppClientListener.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala| 11 ++--
 .../executor/CoarseGrainedExecutorBackend.scala|  7 +--
 .../org/apache/spark/internal/config/package.scala | 10 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 17 +++---
 .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++--
 .../spark/scheduler/ExecutorLossReason.scala   | 10 ++--
 .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++---
 .../apache/spark/scheduler/TaskSetManager.scala|  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++
 .../cluster/StandaloneSchedulerBackend.scala   |  7 ++-
 .../spark/deploy/client/AppClientSuite.scala   |  4 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 --
 .../spark/scheduler/TaskSchedulerImplSuite.scala   | 36 -
 .../spark/scheduler/TaskSetManagerSuite.scala  |  9 ++--
 .../WorkerDecommissionExtendedSuite.scala  |  2 +-
 .../spark/scheduler/WorkerDecommissionSuite.scala  |  2 +-
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../scheduler/ExecutorAllocationManager.scala  |  2 +-
 .../scheduler/ExecutorAllocationManagerSuite.scala |  2 +-
 23 files changed, 98 insertions(+), 164 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4144b6d -> 125cbe3)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0
 add 125cbe3  [SPARK-32736][CORE] Avoid caching the removed decommissioned 
executors in TaskSchedulerImpl

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ExecutorAllocationManager.scala   |  2 +-
 .../org/apache/spark/deploy/DeployMessage.scala|  4 +-
 .../spark/deploy/client/StandaloneAppClient.scala  |  6 +--
 .../client/StandaloneAppClientListener.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala| 11 ++--
 .../executor/CoarseGrainedExecutorBackend.scala|  7 +--
 .../org/apache/spark/internal/config/package.scala | 10 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 17 +++---
 .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++--
 .../spark/scheduler/ExecutorLossReason.scala   | 10 ++--
 .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++---
 .../apache/spark/scheduler/TaskSetManager.scala|  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++
 .../cluster/StandaloneSchedulerBackend.scala   |  7 ++-
 .../spark/deploy/client/AppClientSuite.scala   |  4 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 --
 .../spark/scheduler/TaskSchedulerImplSuite.scala   | 36 -
 .../spark/scheduler/TaskSetManagerSuite.scala  |  9 ++--
 .../WorkerDecommissionExtendedSuite.scala  |  2 +-
 .../spark/scheduler/WorkerDecommissionSuite.scala  |  2 +-
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../scheduler/ExecutorAllocationManager.scala  |  2 +-
 .../scheduler/ExecutorAllocationManagerSuite.scala |  2 +-
 23 files changed, 98 insertions(+), 164 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4144b6d -> 125cbe3)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0
 add 125cbe3  [SPARK-32736][CORE] Avoid caching the removed decommissioned 
executors in TaskSchedulerImpl

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ExecutorAllocationManager.scala   |  2 +-
 .../org/apache/spark/deploy/DeployMessage.scala|  4 +-
 .../spark/deploy/client/StandaloneAppClient.scala  |  6 +--
 .../client/StandaloneAppClientListener.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala| 11 ++--
 .../executor/CoarseGrainedExecutorBackend.scala|  7 +--
 .../org/apache/spark/internal/config/package.scala | 10 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 17 +++---
 .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++--
 .../spark/scheduler/ExecutorLossReason.scala   | 10 ++--
 .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++---
 .../apache/spark/scheduler/TaskSetManager.scala|  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++
 .../cluster/StandaloneSchedulerBackend.scala   |  7 ++-
 .../spark/deploy/client/AppClientSuite.scala   |  4 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 --
 .../spark/scheduler/TaskSchedulerImplSuite.scala   | 36 -
 .../spark/scheduler/TaskSetManagerSuite.scala  |  9 ++--
 .../WorkerDecommissionExtendedSuite.scala  |  2 +-
 .../spark/scheduler/WorkerDecommissionSuite.scala  |  2 +-
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../scheduler/ExecutorAllocationManager.scala  |  2 +-
 .../scheduler/ExecutorAllocationManagerSuite.scala |  2 +-
 23 files changed, 98 insertions(+), 164 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 125cbe3  [SPARK-32736][CORE] Avoid caching the removed decommissioned 
executors in TaskSchedulerImpl
125cbe3 is described below

commit 125cbe3ae0d664ddc80b5b83cc82a43a0cefb5ca
Author: yi.wu 
AuthorDate: Tue Sep 8 04:40:13 2020 +

[SPARK-32736][CORE] Avoid caching the removed decommissioned executors in 
TaskSchedulerImpl

### What changes were proposed in this pull request?

The motivation of this PR is to avoid caching the removed decommissioned 
executors in `TaskSchedulerImpl`. The cache is introduced in 
https://github.com/apache/spark/pull/29422. The cache will hold the 
`isHostDecommissioned` info for a while. So if the task `FetchFailure` event 
comes after the executor loss event, `DAGScheduler` can still get the 
`isHostDecommissioned` from the cache and unregister the host shuffle map 
status when the host is decommissioned too.

This PR tries to achieve the same goal without the cache. Instead of saving 
the `workerLost` in `ExecutorUpdated` / `ExecutorDecommissionInfo` / 
`ExecutorDecommissionState`, we could save the `hostOpt` directly. When the 
host is decommissioned or lost too, the `hostOpt` can be a specific host 
address. Otherwise, it's `None` to indicate that only the executor is 
decommissioned or lost.

Now that we have the host info, we can also unregister the host shuffle map 
status when `executorLost` is triggered for the decommissioned executor.

Besides, this PR also includes a few cleanups around the touched code.

### Why are the changes needed?

It helps to unregister the shuffle map status earlier for both decommission 
and normal executor lost cases.

It also saves memory in  `TaskSchedulerImpl` and simplifies the code a 
little bit.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR only refactor the code. The original behaviour should be covered by 
`DecommissionWorkerSuite`.

Closes #29579 from Ngone51/impr-decom.

Authored-by: yi.wu 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/ExecutorAllocationManager.scala   |  2 +-
 .../org/apache/spark/deploy/DeployMessage.scala|  4 +-
 .../spark/deploy/client/StandaloneAppClient.scala  |  6 +--
 .../client/StandaloneAppClientListener.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala| 11 ++--
 .../executor/CoarseGrainedExecutorBackend.scala|  7 +--
 .../org/apache/spark/internal/config/package.scala | 10 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 17 +++---
 .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++--
 .../spark/scheduler/ExecutorLossReason.scala   | 10 ++--
 .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++---
 .../apache/spark/scheduler/TaskSetManager.scala|  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++
 .../cluster/StandaloneSchedulerBackend.scala   |  7 ++-
 .../spark/deploy/client/AppClientSuite.scala   |  4 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 --
 .../spark/scheduler/TaskSchedulerImplSuite.scala   | 36 -
 .../spark/scheduler/TaskSetManagerSuite.scala  |  9 ++--
 .../WorkerDecommissionExtendedSuite.scala  |  2 +-
 .../spark/scheduler/WorkerDecommissionSuite.scala  |  2 +-
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../scheduler/ExecutorAllocationManager.scala  |  2 +-
 .../scheduler/ExecutorAllocationManagerSuite.scala |  2 +-
 23 files changed, 98 insertions(+), 164 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala 
b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
index c298931..b6e14e8 100644
--- a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
+++ b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
@@ -580,7 +580,7 @@ private[spark] class ExecutorAllocationManager(
   // when the task backlog decreased.
   if (decommissionEnabled) {
 val executorIdsWithoutHostLoss = executorIdsToBeRemoved.toSeq.map(
-  id => (id, ExecutorDecommissionInfo("spark scale down", 
false))).toArray
+  id => (id, ExecutorDecommissionInfo("spark scale down"))).toArray
 client.decommissionExecutors(executorIdsWithoutHostLoss, 
adjustTargetNumExecutors = false)
   } else {
 client.killExecutors(executorIdsToBeRemoved.toSeq, 
adjustTargetNumExecutors = false,
diff --git a/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala 
b/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
index b7a64d75..83f373d 100644
---

[spark] branch branch-3.0 updated (6b47abd -> f42b56c)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add f42b56c  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c8c082c -> 4144b6d)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (6b47abd -> f42b56c)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add f42b56c  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c8c082c -> 4144b6d)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (6b47abd -> f42b56c)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add f42b56c  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c8c082c -> 4144b6d)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (6b47abd -> f42b56c)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add f42b56c  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c8c082c -> 4144b6d)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 277ccba  [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators 
thread-safe
277ccba is described below

commit 277ccba382ac19debeaf15cd648cf6ab69603012
Author: sychen 
AuthorDate: Tue Sep 8 03:23:59 2020 +

[SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their 
own `Location` object. This used to be shared, and this was causing issues when 
the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all 
other iterators. This is a violation of the iterator contract, and it causes 
issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
add ut

Closes #29605 from cxzl25/SPARK-31511.

Lead-authored-by: sychen 
Co-authored-by: herman 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/unsafe/map/BytesToBytesMap.java   | 18 +-
 .../sql/execution/joins/HashedRelationSuite.scala  | 39 ++
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 2b23fbbf..5ab52cc 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator iterator() {
-return new MapIterator(numValues, loc, false);
+return new MapIterator(numValues, new Location(), false);
   }
 
   /**
@@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator destructiveIterator() {
 updatePeakMemoryUsed();
-return new MapIterator(numValues, loc, true);
+return new MapIterator(numValues, new Location(), true);
   }
 
   /**
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength) {
 safeLookup(keyBase, keyOffset, keyLength, loc,
@@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer {
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength, int 
hash) {
 safeLookup(keyBase, keyOffset, keyLength, loc, hash);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index d9b34dc..1bdd6fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++

[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new f2bcc93  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
f2bcc93 is described below

commit f2bcc9349d86be71dba491b8348ac8d83f0764a8
Author: itholic 
AuthorDate: Tue Sep 8 12:22:13 2020 +0900

[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main 
process for run-tests.py

### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as 
below:

```
Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main 
process is initiated.

It works in most environments for an unknown reason but it should be good 
to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 
'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic 
Signed-off-by: HyukjinKwon 
(cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf)
Signed-off-by: HyukjinKwon 
---
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index c34e48a..9a95c96 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -53,7 +53,7 @@ def print_red(text):
 print('\033[31m' + text + '\033[0m')
 
 
-SKIPPED_TESTS = Manager().dict()
+SKIPPED_TESTS = None
 LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log")
 FAILURE_REPORTING_LOCK = Lock()
 LOGGER = logging.getLogger()
@@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 skipped_counts = len(skipped_tests)
 if skipped_counts > 0:
 key = (pyspark_python, test_name)
+assert SKIPPED_TESTS is not None
 SKIPPED_TESTS[key] = skipped_tests
 per_test_output.close()
 except:
@@ -293,4 +294,5 @@ def main():
 
 
 if __name__ == "__main__":
+SKIPPED_TESTS = Manager().dict()
 main()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (6b47abd -> f42b56c)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add f42b56c  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c8c082c -> 4144b6d)

2020-09-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
 add 4144b6d  [SPARK-32764][SQL] -0.0 should be equal to 0.0

No new revisions were added by this update.

Summary of changes:
 .../expressions/codegen/CodeGenerator.scala| 10 ++-
 .../spark/sql/catalyst/util/SQLOrderingUtil.scala  | 30 +
 .../org/apache/spark/sql/types/DoubleType.scala|  3 +-
 .../org/apache/spark/sql/types/FloatType.scala |  3 +-
 .../org/apache/spark/sql/types/numerics.scala  |  5 +-
 .../sql/catalyst/expressions/PredicateSuite.scala  | 16 +
 .../sql/catalyst/util/SQLOrderingUtilSuite.scala   | 75 ++
 .../org/apache/spark/sql/DataFrameSuite.scala  |  5 ++
 8 files changed, 126 insertions(+), 21 deletions(-)
 copy core/src/main/scala/org/apache/spark/SparkFiles.scala => 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala
 (57%)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 277ccba  [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators 
thread-safe
277ccba is described below

commit 277ccba382ac19debeaf15cd648cf6ab69603012
Author: sychen 
AuthorDate: Tue Sep 8 03:23:59 2020 +

[SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their 
own `Location` object. This used to be shared, and this was causing issues when 
the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all 
other iterators. This is a violation of the iterator contract, and it causes 
issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
add ut

Closes #29605 from cxzl25/SPARK-31511.

Lead-authored-by: sychen 
Co-authored-by: herman 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/unsafe/map/BytesToBytesMap.java   | 18 +-
 .../sql/execution/joins/HashedRelationSuite.scala  | 39 ++
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 2b23fbbf..5ab52cc 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator iterator() {
-return new MapIterator(numValues, loc, false);
+return new MapIterator(numValues, new Location(), false);
   }
 
   /**
@@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator destructiveIterator() {
 updatePeakMemoryUsed();
-return new MapIterator(numValues, loc, true);
+return new MapIterator(numValues, new Location(), true);
   }
 
   /**
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength) {
 safeLookup(keyBase, keyOffset, keyLength, loc,
@@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer {
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength, int 
hash) {
 safeLookup(keyBase, keyOffset, keyLength, loc, hash);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index d9b34dc..1bdd6fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++

[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new f2bcc93  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
f2bcc93 is described below

commit f2bcc9349d86be71dba491b8348ac8d83f0764a8
Author: itholic 
AuthorDate: Tue Sep 8 12:22:13 2020 +0900

[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main 
process for run-tests.py

### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as 
below:

```
Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main 
process is initiated.

It works in most environments for an unknown reason but it should be good 
to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 
'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic 
Signed-off-by: HyukjinKwon 
(cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf)
Signed-off-by: HyukjinKwon 
---
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index c34e48a..9a95c96 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -53,7 +53,7 @@ def print_red(text):
 print('\033[31m' + text + '\033[0m')
 
 
-SKIPPED_TESTS = Manager().dict()
+SKIPPED_TESTS = None
 LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log")
 FAILURE_REPORTING_LOCK = Lock()
 LOGGER = logging.getLogger()
@@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 skipped_counts = len(skipped_tests)
 if skipped_counts > 0:
 key = (pyspark_python, test_name)
+assert SKIPPED_TESTS is not None
 SKIPPED_TESTS[key] = skipped_tests
 per_test_output.close()
 except:
@@ -293,4 +294,5 @@ def main():
 
 
 if __name__ == "__main__":
+SKIPPED_TESTS = Manager().dict()
 main()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c336ae3 -> c8c082c)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging
 add c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 277ccba  [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators 
thread-safe
277ccba is described below

commit 277ccba382ac19debeaf15cd648cf6ab69603012
Author: sychen 
AuthorDate: Tue Sep 8 03:23:59 2020 +

[SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their 
own `Location` object. This used to be shared, and this was causing issues when 
the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all 
other iterators. This is a violation of the iterator contract, and it causes 
issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
add ut

Closes #29605 from cxzl25/SPARK-31511.

Lead-authored-by: sychen 
Co-authored-by: herman 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/unsafe/map/BytesToBytesMap.java   | 18 +-
 .../sql/execution/joins/HashedRelationSuite.scala  | 39 ++
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 2b23fbbf..5ab52cc 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator iterator() {
-return new MapIterator(numValues, loc, false);
+return new MapIterator(numValues, new Location(), false);
   }
 
   /**
@@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator destructiveIterator() {
 updatePeakMemoryUsed();
-return new MapIterator(numValues, loc, true);
+return new MapIterator(numValues, new Location(), true);
   }
 
   /**
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength) {
 safeLookup(keyBase, keyOffset, keyLength, loc,
@@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer {
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength, int 
hash) {
 safeLookup(keyBase, keyOffset, keyLength, loc, hash);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index d9b34dc..1bdd6fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++

[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new f2bcc93  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
f2bcc93 is described below

commit f2bcc9349d86be71dba491b8348ac8d83f0764a8
Author: itholic 
AuthorDate: Tue Sep 8 12:22:13 2020 +0900

[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main 
process for run-tests.py

### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as 
below:

```
Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main 
process is initiated.

It works in most environments for an unknown reason but it should be good 
to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 
'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic 
Signed-off-by: HyukjinKwon 
(cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf)
Signed-off-by: HyukjinKwon 
---
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index c34e48a..9a95c96 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -53,7 +53,7 @@ def print_red(text):
 print('\033[31m' + text + '\033[0m')
 
 
-SKIPPED_TESTS = Manager().dict()
+SKIPPED_TESTS = None
 LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log")
 FAILURE_REPORTING_LOCK = Lock()
 LOGGER = logging.getLogger()
@@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 skipped_counts = len(skipped_tests)
 if skipped_counts > 0:
 key = (pyspark_python, test_name)
+assert SKIPPED_TESTS is not None
 SKIPPED_TESTS[key] = skipped_tests
 per_test_output.close()
 except:
@@ -293,4 +294,5 @@ def main():
 
 
 if __name__ == "__main__":
+SKIPPED_TESTS = Manager().dict()
 main()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c336ae3 -> c8c082c)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging
 add c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 277ccba  [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators 
thread-safe
277ccba is described below

commit 277ccba382ac19debeaf15cd648cf6ab69603012
Author: sychen 
AuthorDate: Tue Sep 8 03:23:59 2020 +

[SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their 
own `Location` object. This used to be shared, and this was causing issues when 
the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all 
other iterators. This is a violation of the iterator contract, and it causes 
issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
add ut

Closes #29605 from cxzl25/SPARK-31511.

Lead-authored-by: sychen 
Co-authored-by: herman 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/unsafe/map/BytesToBytesMap.java   | 18 +-
 .../sql/execution/joins/HashedRelationSuite.scala  | 39 ++
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 2b23fbbf..5ab52cc 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator iterator() {
-return new MapIterator(numValues, loc, false);
+return new MapIterator(numValues, new Location(), false);
   }
 
   /**
@@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator destructiveIterator() {
 updatePeakMemoryUsed();
-return new MapIterator(numValues, loc, true);
+return new MapIterator(numValues, new Location(), true);
   }
 
   /**
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength) {
 safeLookup(keyBase, keyOffset, keyLength, loc,
@@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer {
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength, int 
hash) {
 safeLookup(keyBase, keyOffset, keyLength, loc, hash);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index d9b34dc..1bdd6fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++

[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new f2bcc93  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
f2bcc93 is described below

commit f2bcc9349d86be71dba491b8348ac8d83f0764a8
Author: itholic 
AuthorDate: Tue Sep 8 12:22:13 2020 +0900

[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main 
process for run-tests.py

### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as 
below:

```
Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main 
process is initiated.

It works in most environments for an unknown reason but it should be good 
to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 
'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic 
Signed-off-by: HyukjinKwon 
(cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf)
Signed-off-by: HyukjinKwon 
---
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index c34e48a..9a95c96 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -53,7 +53,7 @@ def print_red(text):
 print('\033[31m' + text + '\033[0m')
 
 
-SKIPPED_TESTS = Manager().dict()
+SKIPPED_TESTS = None
 LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log")
 FAILURE_REPORTING_LOCK = Lock()
 LOGGER = logging.getLogger()
@@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 skipped_counts = len(skipped_tests)
 if skipped_counts > 0:
 key = (pyspark_python, test_name)
+assert SKIPPED_TESTS is not None
 SKIPPED_TESTS[key] = skipped_tests
 per_test_output.close()
 except:
@@ -293,4 +294,5 @@ def main():
 
 
 if __name__ == "__main__":
+SKIPPED_TESTS = Manager().dict()
 main()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c336ae3 -> c8c082c)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging
 add c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 277ccba  [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators 
thread-safe
277ccba is described below

commit 277ccba382ac19debeaf15cd648cf6ab69603012
Author: sychen 
AuthorDate: Tue Sep 8 03:23:59 2020 +

[SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe

### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their 
own `Location` object. This used to be shared, and this was causing issues when 
the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all 
other iterators. This is a violation of the iterator contract, and it causes 
issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
add ut

Closes #29605 from cxzl25/SPARK-31511.

Lead-authored-by: sychen 
Co-authored-by: herman 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/unsafe/map/BytesToBytesMap.java   | 18 +-
 .../sql/execution/joins/HashedRelationSuite.scala  | 39 ++
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 2b23fbbf..5ab52cc 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator iterator() {
-return new MapIterator(numValues, loc, false);
+return new MapIterator(numValues, new Location(), false);
   }
 
   /**
@@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer 
{
*
* For efficiency, all calls to `next()` will return the same {@link 
Location} object.
*
-   * If any other lookups or operations are performed on this map while 
iterating over it, including
-   * `lookup()`, the behavior of the returned iterator is undefined.
+   * The returned iterator is thread-safe. However if the map is modified 
while iterating over it,
+   * the behavior of the returned iterator is undefined.
*/
   public MapIterator destructiveIterator() {
 updatePeakMemoryUsed();
-return new MapIterator(numValues, loc, true);
+return new MapIterator(numValues, new Location(), true);
   }
 
   /**
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength) {
 safeLookup(keyBase, keyOffset, keyLength, loc,
@@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer {
* Looks up a key, and return a {@link Location} handle that can be used to 
test existence
* and read/write values.
*
-   * This function always return the same {@link Location} instance to avoid 
object allocation.
+   * This function always returns the same {@link Location} instance to avoid 
object allocation.
+   * This function is not thread-safe.
*/
   public Location lookup(Object keyBase, long keyOffset, int keyLength, int 
hash) {
 safeLookup(keyBase, keyOffset, keyLength, loc, hash);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index d9b34dc..1bdd6fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++

[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new f2bcc93  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py
f2bcc93 is described below

commit f2bcc9349d86be71dba491b8348ac8d83f0764a8
Author: itholic 
AuthorDate: Tue Sep 8 12:22:13 2020 +0900

[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main 
process for run-tests.py

### What changes were proposed in this pull request?

In certain environments, seems it fails to run `run-tests.py` script as 
below:

```
Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

```

The reason is that `Manager.dict()` launches another process when the main 
process is initiated.

It works in most environments for an unknown reason but it should be good 
to avoid such pattern as guided from Python itself.

### Why are the changes needed?

To prevent the test failure for Python.

### Does this PR introduce _any_ user-facing change?

No, it fixes a test script.

### How was this patch tested?

Manually ran the script after fixing.

```
Running PySpark tests. Output is in /.../python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 
'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(/.../python3): pyspark.sql.dataframe (33s)
Finished test(python3.8): pyspark.sql.dataframe (34s)
Tests passed in 34 seconds
```

Closes #29666 from itholic/SPARK-32812.

Authored-by: itholic 
Signed-off-by: HyukjinKwon 
(cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf)
Signed-off-by: HyukjinKwon 
---
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index c34e48a..9a95c96 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -53,7 +53,7 @@ def print_red(text):
 print('\033[31m' + text + '\033[0m')
 
 
-SKIPPED_TESTS = Manager().dict()
+SKIPPED_TESTS = None
 LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log")
 FAILURE_REPORTING_LOCK = Lock()
 LOGGER = logging.getLogger()
@@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 skipped_counts = len(skipped_tests)
 if skipped_counts > 0:
 key = (pyspark_python, test_name)
+assert SKIPPED_TESTS is not None
 SKIPPED_TESTS[key] = skipped_tests
 per_test_output.close()
 except:
@@ -293,4 +294,5 @@ def main():
 
 
 if __name__ == "__main__":
+SKIPPED_TESTS = Manager().dict()
 main()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 6b47abd  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c336ae3 -> c8c082c)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging
 add c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c336ae3 -> c8c082c)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging
 add c8c082c  [SPARK-32812][PYTHON][TESTS] Avoid initiating a process 
during the main process for run-tests.py

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (117a6f1 -> c336ae3)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan
 add c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging

No new revisions were added by this update.

Summary of changes:
 docs/img/pyspark-remote-debug1.png   | Bin 0 -> 91214 bytes
 docs/img/pyspark-remote-debug2.png   | Bin 0 -> 10058 bytes
 python/docs/source/development/debugging.rst | 280 +++
 python/docs/source/development/index.rst |   1 +
 4 files changed, 281 insertions(+)
 create mode 100644 docs/img/pyspark-remote-debug1.png
 create mode 100644 docs/img/pyspark-remote-debug2.png
 create mode 100644 python/docs/source/development/debugging.rst


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (117a6f1 -> c336ae3)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan
 add c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging

No new revisions were added by this update.

Summary of changes:
 docs/img/pyspark-remote-debug1.png   | Bin 0 -> 91214 bytes
 docs/img/pyspark-remote-debug2.png   | Bin 0 -> 10058 bytes
 python/docs/source/development/debugging.rst | 280 +++
 python/docs/source/development/index.rst |   1 +
 4 files changed, 281 insertions(+)
 create mode 100644 docs/img/pyspark-remote-debug1.png
 create mode 100644 docs/img/pyspark-remote-debug2.png
 create mode 100644 python/docs/source/development/debugging.rst


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (117a6f1 -> c336ae3)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan
 add c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging

No new revisions were added by this update.

Summary of changes:
 docs/img/pyspark-remote-debug1.png   | Bin 0 -> 91214 bytes
 docs/img/pyspark-remote-debug2.png   | Bin 0 -> 10058 bytes
 python/docs/source/development/debugging.rst | 280 +++
 python/docs/source/development/index.rst |   1 +
 4 files changed, 281 insertions(+)
 create mode 100644 docs/img/pyspark-remote-debug1.png
 create mode 100644 docs/img/pyspark-remote-debug2.png
 create mode 100644 python/docs/source/development/debugging.rst


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (117a6f1 -> c336ae3)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan
 add c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging

No new revisions were added by this update.

Summary of changes:
 docs/img/pyspark-remote-debug1.png   | Bin 0 -> 91214 bytes
 docs/img/pyspark-remote-debug2.png   | Bin 0 -> 10058 bytes
 python/docs/source/development/debugging.rst | 280 +++
 python/docs/source/development/index.rst |   1 +
 4 files changed, 281 insertions(+)
 create mode 100644 docs/img/pyspark-remote-debug1.png
 create mode 100644 docs/img/pyspark-remote-debug2.png
 create mode 100644 python/docs/source/development/debugging.rst


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (117a6f1 -> c336ae3)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan
 add c336ae3  [SPARK-32186][DOCS][PYTHON] Development - Debugging

No new revisions were added by this update.

Summary of changes:
 docs/img/pyspark-remote-debug1.png   | Bin 0 -> 91214 bytes
 docs/img/pyspark-remote-debug2.png   | Bin 0 -> 10058 bytes
 python/docs/source/development/debugging.rst | 280 +++
 python/docs/source/development/index.rst |   1 +
 4 files changed, 281 insertions(+)
 create mode 100644 docs/img/pyspark-remote-debug1.png
 create mode 100644 docs/img/pyspark-remote-debug2.png
 create mode 100644 python/docs/source/development/debugging.rst


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (954cd9f -> 117a6f1)

2020-09-07 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  58 -
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  85 ++
 .../catalyst/plans/logical/AnalysisHelper.scala|  15 ++-
 .../sql/catalyst/analysis/TypeCoercionSuite.scala  |   9 +-
 5 files changed, 136 insertions(+), 161 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (954cd9f -> 117a6f1)

2020-09-07 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  58 -
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  85 ++
 .../catalyst/plans/logical/AnalysisHelper.scala|  15 ++-
 .../sql/catalyst/analysis/TypeCoercionSuite.scala  |   9 +-
 5 files changed, 136 insertions(+), 161 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (954cd9f -> 117a6f1)

2020-09-07 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  58 -
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  85 ++
 .../catalyst/plans/logical/AnalysisHelper.scala|  15 ++-
 .../sql/catalyst/analysis/TypeCoercionSuite.scala  |   9 +-
 5 files changed, 136 insertions(+), 161 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 5 files changed, 60 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bc471f3  [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid 
globbing paths when inferring schema
bc471f3 is described below

commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c
Author: Max Gekk 
AuthorDate: Tue Sep 8 09:45:17 2020 +0900

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths 
when inferring schema

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in 
Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, 
`{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths 
`[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/[abc].csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4.

Authored-by: Max Gekk 
Signed-off-by: HyukjinKwon 
---
 .../sql/execution/datasources/DataSource.scala | 27 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index ff5fe09..31c91f7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -65,8 +65,9 @@ import org.apache.spark.util.Utils
  * metadata.  For example, when reading a partitioned table from a file 
system, partition columns
  * will be inferred from the directory layout even if they are not specified.
  *
- * @param paths A list of file system paths that hold data.  These will be 
globbed before and
- *  qualified. This option only works when reading from a 
[[FileFormat]].
+ * @param paths A list of file system paths that hold data. These will be 
globbed before if
+ *  the "__globPaths__" option is true, and will be qualified. 
This option only works
+ *  when reading from a [[FileFormat]].
  * @param userSpecifiedSchema An optional specification of the schema of the 
data. When present
  *we skip attempting to infer the schema.
  * @param partitionColumns A list of column names that the relation is 
partitioned by. This list is
@@ -97,6 +98,15 @@ case class DataSource(
   private val caseInsensitiveOptions = CaseInsensitiveMap(options)
   private val equality = sparkSession.sessionState.conf.resolver
 
+  /**
+   * Whether or not paths should be globbed before being used to access files.
+   */
+  def globPaths: Boolean = {
+options.get(DataSource.GLOB_PATHS_KEY)
+  .map(_ == "true")
+  .getOrElse(true)
+  }
+
   bucketSpec.map { bucket =>
 SchemaUtils.checkColumnNameDuplication(
   bucket.bucketColumnNames, "in the bucket definition", equality)
@@ -223,7 +233,7 @@ case class DataSource(
 // For glob pattern, we do not check it because the glob pattern might 
only make sense
 // once the streaming job starts and some upstream source starts 
dropping data.
 val hdfsPath = new Path(path)
-if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
   val fs = hdfsPath.getFileSystem(newHadoopConfiguration())
   if (!fs.exists(hdfsPath)) {
 throw new AnalysisException(s"Path does not exist: $path")
@@ -550,7 +560,11 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = hdfsPath.getFileSystem(hadoopConf)
   val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-  val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,

[spark] branch master updated (8bd3770 -> 954cd9f)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark
 add 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../execution/datasources/DataSourceSuite.scala| 18 
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 6 files changed, 72 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (954cd9f -> 117a6f1)

2020-09-07 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  58 -
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  85 ++
 .../catalyst/plans/logical/AnalysisHelper.scala|  15 ++-
 .../sql/catalyst/analysis/TypeCoercionSuite.scala  |   9 +-
 5 files changed, 136 insertions(+), 161 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 5 files changed, 60 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bc471f3  [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid 
globbing paths when inferring schema
bc471f3 is described below

commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c
Author: Max Gekk 
AuthorDate: Tue Sep 8 09:45:17 2020 +0900

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths 
when inferring schema

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in 
Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, 
`{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths 
`[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/[abc].csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4.

Authored-by: Max Gekk 
Signed-off-by: HyukjinKwon 
---
 .../sql/execution/datasources/DataSource.scala | 27 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index ff5fe09..31c91f7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -65,8 +65,9 @@ import org.apache.spark.util.Utils
  * metadata.  For example, when reading a partitioned table from a file 
system, partition columns
  * will be inferred from the directory layout even if they are not specified.
  *
- * @param paths A list of file system paths that hold data.  These will be 
globbed before and
- *  qualified. This option only works when reading from a 
[[FileFormat]].
+ * @param paths A list of file system paths that hold data. These will be 
globbed before if
+ *  the "__globPaths__" option is true, and will be qualified. 
This option only works
+ *  when reading from a [[FileFormat]].
  * @param userSpecifiedSchema An optional specification of the schema of the 
data. When present
  *we skip attempting to infer the schema.
  * @param partitionColumns A list of column names that the relation is 
partitioned by. This list is
@@ -97,6 +98,15 @@ case class DataSource(
   private val caseInsensitiveOptions = CaseInsensitiveMap(options)
   private val equality = sparkSession.sessionState.conf.resolver
 
+  /**
+   * Whether or not paths should be globbed before being used to access files.
+   */
+  def globPaths: Boolean = {
+options.get(DataSource.GLOB_PATHS_KEY)
+  .map(_ == "true")
+  .getOrElse(true)
+  }
+
   bucketSpec.map { bucket =>
 SchemaUtils.checkColumnNameDuplication(
   bucket.bucketColumnNames, "in the bucket definition", equality)
@@ -223,7 +233,7 @@ case class DataSource(
 // For glob pattern, we do not check it because the glob pattern might 
only make sense
 // once the streaming job starts and some upstream source starts 
dropping data.
 val hdfsPath = new Path(path)
-if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
   val fs = hdfsPath.getFileSystem(newHadoopConfiguration())
   if (!fs.exists(hdfsPath)) {
 throw new AnalysisException(s"Path does not exist: $path")
@@ -550,7 +560,11 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = hdfsPath.getFileSystem(hadoopConf)
   val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-  val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,

[spark] branch master updated (8bd3770 -> 954cd9f)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark
 add 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../execution/datasources/DataSourceSuite.scala| 18 
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 6 files changed, 72 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c43460c -> 8bd3770)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
 add 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (954cd9f -> 117a6f1)

2020-09-07 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema
 add 117a6f1  [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods 
to QueryPlan

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  58 -
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  85 ++
 .../catalyst/plans/logical/AnalysisHelper.scala|  15 ++-
 .../sql/catalyst/analysis/TypeCoercionSuite.scala  |   9 +-
 5 files changed, 136 insertions(+), 161 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 5 files changed, 60 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bc471f3  [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid 
globbing paths when inferring schema
bc471f3 is described below

commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c
Author: Max Gekk 
AuthorDate: Tue Sep 8 09:45:17 2020 +0900

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths 
when inferring schema

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in 
Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, 
`{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths 
`[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/[abc].csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4.

Authored-by: Max Gekk 
Signed-off-by: HyukjinKwon 
---
 .../sql/execution/datasources/DataSource.scala | 27 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index ff5fe09..31c91f7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -65,8 +65,9 @@ import org.apache.spark.util.Utils
  * metadata.  For example, when reading a partitioned table from a file 
system, partition columns
  * will be inferred from the directory layout even if they are not specified.
  *
- * @param paths A list of file system paths that hold data.  These will be 
globbed before and
- *  qualified. This option only works when reading from a 
[[FileFormat]].
+ * @param paths A list of file system paths that hold data. These will be 
globbed before if
+ *  the "__globPaths__" option is true, and will be qualified. 
This option only works
+ *  when reading from a [[FileFormat]].
  * @param userSpecifiedSchema An optional specification of the schema of the 
data. When present
  *we skip attempting to infer the schema.
  * @param partitionColumns A list of column names that the relation is 
partitioned by. This list is
@@ -97,6 +98,15 @@ case class DataSource(
   private val caseInsensitiveOptions = CaseInsensitiveMap(options)
   private val equality = sparkSession.sessionState.conf.resolver
 
+  /**
+   * Whether or not paths should be globbed before being used to access files.
+   */
+  def globPaths: Boolean = {
+options.get(DataSource.GLOB_PATHS_KEY)
+  .map(_ == "true")
+  .getOrElse(true)
+  }
+
   bucketSpec.map { bucket =>
 SchemaUtils.checkColumnNameDuplication(
   bucket.bucketColumnNames, "in the bucket definition", equality)
@@ -223,7 +233,7 @@ case class DataSource(
 // For glob pattern, we do not check it because the glob pattern might 
only make sense
 // once the streaming job starts and some upstream source starts 
dropping data.
 val hdfsPath = new Path(path)
-if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
   val fs = hdfsPath.getFileSystem(newHadoopConfiguration())
   if (!fs.exists(hdfsPath)) {
 throw new AnalysisException(s"Path does not exist: $path")
@@ -550,7 +560,11 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = hdfsPath.getFileSystem(hadoopConf)
   val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-  val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,

[spark] branch master updated (8bd3770 -> 954cd9f)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark
 add 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../execution/datasources/DataSourceSuite.scala| 18 
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 6 files changed, 72 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c43460c -> 8bd3770)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
 add 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 5 files changed, 60 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bc471f3  [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid 
globbing paths when inferring schema
bc471f3 is described below

commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c
Author: Max Gekk 
AuthorDate: Tue Sep 8 09:45:17 2020 +0900

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths 
when inferring schema

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in 
Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, 
`{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths 
`[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/[abc].csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4.

Authored-by: Max Gekk 
Signed-off-by: HyukjinKwon 
---
 .../sql/execution/datasources/DataSource.scala | 27 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index ff5fe09..31c91f7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -65,8 +65,9 @@ import org.apache.spark.util.Utils
  * metadata.  For example, when reading a partitioned table from a file 
system, partition columns
  * will be inferred from the directory layout even if they are not specified.
  *
- * @param paths A list of file system paths that hold data.  These will be 
globbed before and
- *  qualified. This option only works when reading from a 
[[FileFormat]].
+ * @param paths A list of file system paths that hold data. These will be 
globbed before if
+ *  the "__globPaths__" option is true, and will be qualified. 
This option only works
+ *  when reading from a [[FileFormat]].
  * @param userSpecifiedSchema An optional specification of the schema of the 
data. When present
  *we skip attempting to infer the schema.
  * @param partitionColumns A list of column names that the relation is 
partitioned by. This list is
@@ -97,6 +98,15 @@ case class DataSource(
   private val caseInsensitiveOptions = CaseInsensitiveMap(options)
   private val equality = sparkSession.sessionState.conf.resolver
 
+  /**
+   * Whether or not paths should be globbed before being used to access files.
+   */
+  def globPaths: Boolean = {
+options.get(DataSource.GLOB_PATHS_KEY)
+  .map(_ == "true")
+  .getOrElse(true)
+  }
+
   bucketSpec.map { bucket =>
 SchemaUtils.checkColumnNameDuplication(
   bucket.bucketColumnNames, "in the bucket definition", equality)
@@ -223,7 +233,7 @@ case class DataSource(
 // For glob pattern, we do not check it because the glob pattern might 
only make sense
 // once the streaming job starts and some upstream source starts 
dropping data.
 val hdfsPath = new Path(path)
-if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
   val fs = hdfsPath.getFileSystem(newHadoopConfiguration())
   if (!fs.exists(hdfsPath)) {
 throw new AnalysisException(s"Path does not exist: $path")
@@ -550,7 +560,11 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = hdfsPath.getFileSystem(hadoopConf)
   val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-  val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,

[spark] branch master updated (8bd3770 -> 954cd9f)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark
 add 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../execution/datasources/DataSourceSuite.scala| 18 
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 6 files changed, 72 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c43460c -> 8bd3770)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
 add 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add a22e1c5  [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 5 files changed, 60 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bc471f3  [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid 
globbing paths when inferring schema
bc471f3 is described below

commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c
Author: Max Gekk 
AuthorDate: Tue Sep 8 09:45:17 2020 +0900

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths 
when inferring schema

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in 
Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, 
`{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths 
`[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/[abc].csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4.

Authored-by: Max Gekk 
Signed-off-by: HyukjinKwon 
---
 .../sql/execution/datasources/DataSource.scala | 27 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index ff5fe09..31c91f7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -65,8 +65,9 @@ import org.apache.spark.util.Utils
  * metadata.  For example, when reading a partitioned table from a file 
system, partition columns
  * will be inferred from the directory layout even if they are not specified.
  *
- * @param paths A list of file system paths that hold data.  These will be 
globbed before and
- *  qualified. This option only works when reading from a 
[[FileFormat]].
+ * @param paths A list of file system paths that hold data. These will be 
globbed before if
+ *  the "__globPaths__" option is true, and will be qualified. 
This option only works
+ *  when reading from a [[FileFormat]].
  * @param userSpecifiedSchema An optional specification of the schema of the 
data. When present
  *we skip attempting to infer the schema.
  * @param partitionColumns A list of column names that the relation is 
partitioned by. This list is
@@ -97,6 +98,15 @@ case class DataSource(
   private val caseInsensitiveOptions = CaseInsensitiveMap(options)
   private val equality = sparkSession.sessionState.conf.resolver
 
+  /**
+   * Whether or not paths should be globbed before being used to access files.
+   */
+  def globPaths: Boolean = {
+options.get(DataSource.GLOB_PATHS_KEY)
+  .map(_ == "true")
+  .getOrElse(true)
+  }
+
   bucketSpec.map { bucket =>
 SchemaUtils.checkColumnNameDuplication(
   bucket.bucketColumnNames, "in the bucket definition", equality)
@@ -223,7 +233,7 @@ case class DataSource(
 // For glob pattern, we do not check it because the glob pattern might 
only make sense
 // once the streaming job starts and some upstream source starts 
dropping data.
 val hdfsPath = new Path(path)
-if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
   val fs = hdfsPath.getFileSystem(newHadoopConfiguration())
   if (!fs.exists(hdfsPath)) {
 throw new AnalysisException(s"Path does not exist: $path")
@@ -550,7 +560,11 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = hdfsPath.getFileSystem(hadoopConf)
   val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-  val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,

[spark] branch master updated (8bd3770 -> 954cd9f)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark
 add 954cd9f  [SPARK-32810][SQL] CSV/JSON data sources should avoid 
globbing paths when inferring schema

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSource.scala | 32 ++
 .../execution/datasources/csv/CSVDataSource.scala  |  2 +-
 .../datasources/json/JsonDataSource.scala  |  2 +-
 .../sql/execution/datasources/v2/FileTable.scala   | 10 ++-
 .../execution/datasources/DataSourceSuite.scala| 18 
 .../sql/test/DataFrameReaderWriterSuite.scala  | 23 
 6 files changed, 72 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c43460c -> 8bd3770)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
 add 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c43460c -> 8bd3770)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
 add 8bd3770  [SPARK-32798][PYTHON] Make unionByName optionally fill 
missing columns with nulls in PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

2020-09-07 Thread Bowen Li

[spark] branch master updated (04f7f6da -> c43460c)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec
 add c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala |  7 ++-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (04f7f6da -> c43460c)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec
 add c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala |  7 ++-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (04f7f6da -> c43460c)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec
 add c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala |  7 ++-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (04f7f6da -> c43460c)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec
 add c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala |  7 ++-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c43460c  [SPARK-32753][SQL] Only copy tags to node with no tags
c43460c is described below

commit c43460cf82a075fd071717489798cde6a61b8515
Author: manuzhang 
AuthorDate: Mon Sep 7 16:08:57 2020 +

[SPARK-32753][SQL] Only copy tags to node with no tags

### What changes were proposed in this pull request?
Only copy tags to node with no tags when transforming plans.

### Why are the changes needed?
cloud-fan [made a good 
point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that 
it doesn't make sense to append tags to existing nodes when nodes are removed. 
That will cause such bugs as duplicate rows when deduplicating and 
repartitioning by the same column with AQE.

```
spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id")
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)

// With AQE

[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
  +- Exchange hashpartitioning(id#183L, 10), true
 +- *(3) HashAggregate(keys=[id#183L], functions=[], 
output=[id#183L])
+- Union
   :- *(1) Range (0, 10, step=1, splits=2)
   +- *(2) Range (0, 10, step=1, splits=2)

// Without AQE
[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true
   +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
  +- Union
 :- *(1) Range (0, 10, step=1, splits=2)
 +- *(2) Range (0, 10, step=1, splits=2)
```

It's too expensive to detect node removal so we make a compromise only to 
copy tags to node with no tags.

### Does this PR introduce _any_ user-facing change?
Yes. Fix a bug.

### How was this patch tested?
Add test.

Closes #29593 from manuzhang/spark-32753.

Authored-by: manuzhang 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala |  7 ++-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
index 616572d..8003012 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
@@ -92,7 +92,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product {
   private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty
 
   protected def copyTagsFrom(other: BaseType): Unit = {
-tags ++= other.tags
+// SPARK-32753: it only makes sense to copy tags to a new node
+// but it's too expensive to detect other cases likes node removal
+// so we make a compromise here to copy tags to node with no tags
+if (tags.isEmpty) {
+  tags ++= other.tags
+}
   }
 
   def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
index f892e66..628bafa 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
@@ -1224,4 +1224,18 @@ class AdaptiveQueryExecSuite
   })
 }
   }
+
+  test("SPARK-32753: Only copy tags to node with no tags") {
+withSQLConf(
+  SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true"
+) {
+  spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
+
+  val (_, adaptivePlan) = runAdaptiveAndVerifyResult(
+"SELECT id FROM v1 GROUP BY id DISTRIBUTE BY id")
+  assert(collect(adaptivePlan) {
+case s: ShuffleExchangeExec => s
+  }.length == 1)
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r41342 - /dev/spark/v3.0.1-rc3-docs/

2020-09-07 Thread wenchen

Author: wenchen
Date: Mon Sep  7 15:30:54 2020
New Revision: 41342

Log:
Remove RC artifacts

Removed:
dev/spark/v3.0.1-rc3-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r41344 - /release/spark/spark-3.0.0/

2020-09-07 Thread wenchen

Author: wenchen
Date: Mon Sep  7 15:30:55 2020
New Revision: 41344

Log:
Remove old release

Removed:
release/spark/spark-3.0.0/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r41343 - /dev/spark/v3.0.1-rc3-bin/ /release/spark/spark-3.0.1/

2020-09-07 Thread wenchen

Author: wenchen
Date: Mon Sep  7 15:30:55 2020
New Revision: 41343

Log:
Apache Spark 3.0.1

Added:
release/spark/spark-3.0.1/
  - copied from r41342, dev/spark/v3.0.1-rc3-bin/
Removed:
dev/spark/v3.0.1-rc3-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] tag v3.0.1 created (now 2b147c4)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v3.0.1
in repository https://gitbox.apache.org/repos/asf/spark.git.


  at 2b147c4  (commit)
No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b0322bf -> 04f7f6da)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/SubqueryBroadcastExec.scala  | 16 --
 .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +-
 2 files changed, 72 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b0322bf -> 04f7f6da)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/SubqueryBroadcastExec.scala  | 16 --
 .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +-
 2 files changed, 72 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b0322bf -> 04f7f6da)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/SubqueryBroadcastExec.scala  | 16 --
 .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +-
 2 files changed, 72 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b0322bf -> 04f7f6da)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/SubqueryBroadcastExec.scala  | 16 --
 .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +-
 2 files changed, 72 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
c2c7c9e is described below

commit c2c7c9ef78441682a585abb1dede9b668802a224
Author: sandeep.katta 
AuthorDate: Mon Sep 7 15:10:33 2020 +0900

[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in 
withClient flow, this leads to DeadLock

### What changes were proposed in this pull request?

No need of using database name in `loadPartition` API of `Shim_v3_0` to get 
the hive table, in hive there is a overloaded method which gives hive table 
using table name. By using this API dependency with `SessionCatalog` can be 
removed in Shim layer

### Why are the changes needed?
To avoid deadlock when communicating with Hive metastore 3.1.x
```
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"

Java stack information for the threads listed above:
===
"worker3":
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256)
  - waiting to lock <0x0007858f85f0> (a 
org.apache.spark.sql.hive.HiveSessionCatalog)
  at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown
 Source)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
  - locked <0x000785ef9d78> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  - locked <0x000785c15c80> (a 
org.apache.spark.sql.hive.HiveExternalCatalog)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
  at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512)
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  - locked <0x0007b1690ff8> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown 
Source)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown 
Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

[spark] branch master updated (05fcf26 -> b0322bf)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05fcf26  [SPARK-32677][SQL] Load function resource before create
 add b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b0322bf -> 04f7f6da)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
 add 04f7f6da [SPARK-32748][SQL] Support local property propagation in 
SubqueryBroadcastExec

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/SubqueryBroadcastExec.scala  | 16 --
 .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +-
 2 files changed, 72 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
c2c7c9e is described below

commit c2c7c9ef78441682a585abb1dede9b668802a224
Author: sandeep.katta 
AuthorDate: Mon Sep 7 15:10:33 2020 +0900

[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in 
withClient flow, this leads to DeadLock

### What changes were proposed in this pull request?

No need of using database name in `loadPartition` API of `Shim_v3_0` to get 
the hive table, in hive there is a overloaded method which gives hive table 
using table name. By using this API dependency with `SessionCatalog` can be 
removed in Shim layer

### Why are the changes needed?
To avoid deadlock when communicating with Hive metastore 3.1.x
```
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"

Java stack information for the threads listed above:
===
"worker3":
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256)
  - waiting to lock <0x0007858f85f0> (a 
org.apache.spark.sql.hive.HiveSessionCatalog)
  at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown
 Source)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
  - locked <0x000785ef9d78> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  - locked <0x000785c15c80> (a 
org.apache.spark.sql.hive.HiveExternalCatalog)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
  at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512)
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  - locked <0x0007b1690ff8> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown 
Source)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown 
Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

[spark] branch master updated (05fcf26 -> b0322bf)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05fcf26  [SPARK-32677][SQL] Load function resource before create
 add b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
c2c7c9e is described below

commit c2c7c9ef78441682a585abb1dede9b668802a224
Author: sandeep.katta 
AuthorDate: Mon Sep 7 15:10:33 2020 +0900

[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in 
withClient flow, this leads to DeadLock

### What changes were proposed in this pull request?

No need of using database name in `loadPartition` API of `Shim_v3_0` to get 
the hive table, in hive there is a overloaded method which gives hive table 
using table name. By using this API dependency with `SessionCatalog` can be 
removed in Shim layer

### Why are the changes needed?
To avoid deadlock when communicating with Hive metastore 3.1.x
```
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"

Java stack information for the threads listed above:
===
"worker3":
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256)
  - waiting to lock <0x0007858f85f0> (a 
org.apache.spark.sql.hive.HiveSessionCatalog)
  at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown
 Source)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
  - locked <0x000785ef9d78> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  - locked <0x000785c15c80> (a 
org.apache.spark.sql.hive.HiveExternalCatalog)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
  at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512)
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  - locked <0x0007b1690ff8> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown 
Source)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown 
Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

[spark] branch master updated (05fcf26 -> b0322bf)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05fcf26  [SPARK-32677][SQL] Load function resource before create
 add b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
c2c7c9e is described below

commit c2c7c9ef78441682a585abb1dede9b668802a224
Author: sandeep.katta 
AuthorDate: Mon Sep 7 15:10:33 2020 +0900

[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in 
withClient flow, this leads to DeadLock

### What changes were proposed in this pull request?

No need of using database name in `loadPartition` API of `Shim_v3_0` to get 
the hive table, in hive there is a overloaded method which gives hive table 
using table name. By using this API dependency with `SessionCatalog` can be 
removed in Shim layer

### Why are the changes needed?
To avoid deadlock when communicating with Hive metastore 3.1.x
```
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"

Java stack information for the threads listed above:
===
"worker3":
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256)
  - waiting to lock <0x0007858f85f0> (a 
org.apache.spark.sql.hive.HiveSessionCatalog)
  at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown
 Source)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
  - locked <0x000785ef9d78> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  - locked <0x000785c15c80> (a 
org.apache.spark.sql.hive.HiveExternalCatalog)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
  at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512)
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  - locked <0x0007b1690ff8> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown 
Source)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown 
Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

[spark] branch master updated (05fcf26 -> b0322bf)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05fcf26  [SPARK-32677][SQL] Load function resource before create
 add b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c2c7c9e  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock
c2c7c9e is described below

commit c2c7c9ef78441682a585abb1dede9b668802a224
Author: sandeep.katta 
AuthorDate: Mon Sep 7 15:10:33 2020 +0900

[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in 
withClient flow, this leads to DeadLock

### What changes were proposed in this pull request?

No need of using database name in `loadPartition` API of `Shim_v3_0` to get 
the hive table, in hive there is a overloaded method which gives hive table 
using table name. By using this API dependency with `SessionCatalog` can be 
removed in Shim layer

### Why are the changes needed?
To avoid deadlock when communicating with Hive metastore 3.1.x
```
Found one Java-level deadlock:
=
"worker3":
  waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a 
org.apache.spark.sql.hive.HiveSessionCatalog),
  which is held by "worker0"
"worker0":
  waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a 
org.apache.spark.sql.hive.HiveExternalCatalog),
  which is held by "worker3"

Java stack information for the threads listed above:
===
"worker3":
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256)
  - waiting to lock <0x0007858f85f0> (a 
org.apache.spark.sql.hive.HiveSessionCatalog)
  at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown
 Source)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
  - locked <0x000785ef9d78> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown
 Source)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  - locked <0x000785c15c80> (a 
org.apache.spark.sql.hive.HiveExternalCatalog)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
  at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512)
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  - locked <0x0007b1690ff8> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown 
Source)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown 
Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

[spark] branch master updated (05fcf26 -> b0322bf)

2020-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05fcf26  [SPARK-32677][SQL] Load function resource before create
 add b0322bf  [SPARK-32779][SQL] Avoid using synchronized API of 
SessionCatalog in withClient flow, this leads to DeadLock

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de44e9c -> 05fcf26)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de44e9c  [SPARK-32785][SQL] Interval with dangling parts should not 
results null
 add 05fcf26  [SPARK-32677][SQL] Load function resource before create

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_catalog.py |  5 +++--
 .../spark/sql/catalyst/catalog/SessionCatalog.scala  | 20 ++--
 .../spark/sql/execution/command/functions.scala  | 13 +
 .../test/resources/sql-tests/results/udaf.sql.out|  7 +--
 .../resources/sql-tests/results/udf/udf-udaf.sql.out |  7 +--
 .../spark/sql/execution/command/DDLSuite.scala   | 15 +++
 .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala |  4 +++-
 7 files changed, 54 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de44e9c -> 05fcf26)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de44e9c  [SPARK-32785][SQL] Interval with dangling parts should not 
results null
 add 05fcf26  [SPARK-32677][SQL] Load function resource before create

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_catalog.py |  5 +++--
 .../spark/sql/catalyst/catalog/SessionCatalog.scala  | 20 ++--
 .../spark/sql/execution/command/functions.scala  | 13 +
 .../test/resources/sql-tests/results/udaf.sql.out|  7 +--
 .../resources/sql-tests/results/udf/udf-udaf.sql.out |  7 +--
 .../spark/sql/execution/command/DDLSuite.scala   | 15 +++
 .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala |  4 +++-
 7 files changed, 54 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de44e9c -> 05fcf26)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de44e9c  [SPARK-32785][SQL] Interval with dangling parts should not 
results null
 add 05fcf26  [SPARK-32677][SQL] Load function resource before create

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_catalog.py |  5 +++--
 .../spark/sql/catalyst/catalog/SessionCatalog.scala  | 20 ++--
 .../spark/sql/execution/command/functions.scala  | 13 +
 .../test/resources/sql-tests/results/udaf.sql.out|  7 +--
 .../resources/sql-tests/results/udf/udf-udaf.sql.out |  7 +--
 .../spark/sql/execution/command/DDLSuite.scala   | 15 +++
 .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala |  4 +++-
 7 files changed, 54 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de44e9c -> 05fcf26)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de44e9c  [SPARK-32785][SQL] Interval with dangling parts should not 
results null
 add 05fcf26  [SPARK-32677][SQL] Load function resource before create

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_catalog.py |  5 +++--
 .../spark/sql/catalyst/catalog/SessionCatalog.scala  | 20 ++--
 .../spark/sql/execution/command/functions.scala  | 13 +
 .../test/resources/sql-tests/results/udaf.sql.out|  7 +--
 .../resources/sql-tests/results/udf/udf-udaf.sql.out |  7 +--
 .../spark/sql/execution/command/DDLSuite.scala   | 15 +++
 .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala |  4 +++-
 7 files changed, 54 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de44e9c -> 05fcf26)

2020-09-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de44e9c  [SPARK-32785][SQL] Interval with dangling parts should not 
results null
 add 05fcf26  [SPARK-32677][SQL] Load function resource before create

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_catalog.py |  5 +++--
 .../spark/sql/catalyst/catalog/SessionCatalog.scala  | 20 ++--
 .../spark/sql/execution/command/functions.scala  | 13 +
 .../test/resources/sql-tests/results/udaf.sql.out|  7 +--
 .../resources/sql-tests/results/udf/udf-udaf.sql.out |  7 +--
 .../spark/sql/execution/command/DDLSuite.scala   | 15 +++
 .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala |  4 +++-
 7 files changed, 54 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

95 matches

Mail list logo