date:20200512

[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 5b51880  [SPARK-31691][INFRA] release-build.sh should ignore a 
fallback output from `build/mvn`
5b51880 is described below

commit 5b51880d88e639896a7ade08137b2e8f71203003
Author: Dongjoon Hyun 
AuthorDate: Tue May 12 14:24:56 2020 -0700

[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from 
`build/mvn`

### What changes were proposed in this pull request?

This PR adds `i` option to ignore additional `build/mvn` output which is 
irrelevant to version string.

### Why are the changes needed?

SPARK-28963 added additional output message, `Falling back to 
archive.apache.org to download Maven` in build/mvn. This breaks 
`dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job 
is hitting this issue consistently and broken.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This happens only when the mirror fails. So, this is verified manually 
hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: 
/Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```

Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit ce52f61f720783e8eeb3313c763493054599091a)
Signed-off-by: Dongjoon Hyun 
---
 dev/create-release/release-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 99a5928..1fd8a30 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then
   # Run $MVN in a separate command so that 'set -e' does the right thing.
   TMP=$(mktemp)
   $MVN help:evaluate -Dexpression=project.version > $TMP
-  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download)
+  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi 
Download)
   rm $TMP
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 5b51880  [SPARK-31691][INFRA] release-build.sh should ignore a 
fallback output from `build/mvn`
5b51880 is described below

commit 5b51880d88e639896a7ade08137b2e8f71203003
Author: Dongjoon Hyun 
AuthorDate: Tue May 12 14:24:56 2020 -0700

[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from 
`build/mvn`

### What changes were proposed in this pull request?

This PR adds `i` option to ignore additional `build/mvn` output which is 
irrelevant to version string.

### Why are the changes needed?

SPARK-28963 added additional output message, `Falling back to 
archive.apache.org to download Maven` in build/mvn. This breaks 
`dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job 
is hitting this issue consistently and broken.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This happens only when the mirror fails. So, this is verified manually 
hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: 
/Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```

Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit ce52f61f720783e8eeb3313c763493054599091a)
Signed-off-by: Dongjoon Hyun 
---
 dev/create-release/release-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 99a5928..1fd8a30 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then
   # Run $MVN in a separate command so that 'set -e' does the right thing.
   TMP=$(mktemp)
   $MVN help:evaluate -Dexpression=project.version > $TMP
-  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download)
+  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi 
Download)
   rm $TMP
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0feb3cb -> 3772154)

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0feb3cb  [SPARK-31687][INFRA] Use GitHub instead of GitBox in release 
script
 add 3772154  [SPARK-31691][INFRA] release-build.sh should ignore a 
fallback output from `build/mvn`

No new revisions were added by this update.

Summary of changes:
 dev/create-release/release-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new ce52f61  [SPARK-31691][INFRA] release-build.sh should ignore a 
fallback output from `build/mvn`
ce52f61 is described below

commit ce52f61f720783e8eeb3313c763493054599091a
Author: Dongjoon Hyun 
AuthorDate: Tue May 12 14:24:56 2020 -0700

[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from 
`build/mvn`

### What changes were proposed in this pull request?

This PR adds `i` option to ignore additional `build/mvn` output which is 
irrelevant to version string.

### Why are the changes needed?

SPARK-28963 added additional output message, `Falling back to 
archive.apache.org to download Maven` in build/mvn. This breaks 
`dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job 
is hitting this issue consistently and broken.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This happens only when the mirror fails. So, this is verified manually 
hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: 
/Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```

Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85)
Signed-off-by: Dongjoon Hyun 
---
 dev/create-release/release-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 022d3af..655b079 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then
   # Run $MVN in a separate command so that 'set -e' does the right thing.
   TMP=$(mktemp)
   $MVN help:evaluate -Dexpression=project.version > $TMP
-  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download)
+  SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi 
Download)
   rm $TMP
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0feb3cb -> 3772154)

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0feb3cb  [SPARK-31687][INFRA] Use GitHub instead of GitBox in release 
script
 add 3772154  [SPARK-31691][INFRA] release-build.sh should ignore a 
fallback output from `build/mvn`

No new revisions were added by this update.

Summary of changes:
 dev/create-release/release-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script

2020-05-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0feb3cb  [SPARK-31687][INFRA] Use GitHub instead of GitBox in release 
script
0feb3cb is described below

commit 0feb3cbe7759b7903efccba5eb35cafdc08a027e
Author: Dongjoon Hyun 
AuthorDate: Tue May 12 13:07:00 2020 -0700

[SPARK-31687][INFRA] Use GitHub instead of GitBox in release script

### What changes were proposed in this pull request?

This PR aims to use GitHub urls instead of GitHub in the release scripts.

### Why are the changes needed?

Currently, Spark Packaing jobs are broken due to GitBox issue.
```
fatal: unable to access 'https://gitbox.apache.org/repos/asf/spark.git/': 
Failed to connect to gitbox.apache.org port 443: Connection timed out
```

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2906/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-3.0-maven-snapshots/105/console
 - 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.4-maven-snapshots/439/console

### Does this PR introduce _any_ user-facing change?

No. (This is a dev-only script.)

### How was this patch tested?

Manual.
```
$ cat ./test.sh
. dev/create-release/release-util.sh
get_release_info
git clone "$ASF_REPO"

$ sh test.sh
Branch [branch-3.0]:
Current branch version is 3.0.1-SNAPSHOT.
Release [3.0.0]:
RC # [2]:
Full name [Dongjoon Hyun]:
GPG key [dongjoonapache.org]:

Release details:
BRANCH: branch-3.0
VERSION:3.0.0
TAG:v3.0.0-rc2
NEXT:   3.0.1-SNAPSHOT

ASF USER:   dongjoon
GPG KEY:dongjoonapache.org
FULL NAME:  Dongjoon Hyun
E-MAIL: dongjoonapache.org

Is this info correct [y/n]? y
ASF password:
GPG passphrase:
Cloning into 'spark'...
remote: Enumerating objects: 223, done.
remote: Counting objects: 100% (223/223), done.
remote: Compressing objects: 100% (117/117), done.
remote: Total 708324 (delta 70), reused 138 (delta 32), pack-reused 708101
Receiving objects: 100% (708324/708324), 322.08 MiB | 2.94 MiB/s, done.
Resolving deltas: 100% (289268/289268), done.
Updating files: 100% (16287/16287), done.

$ sh test.sh

...
```

Closes #28513 from dongjoon-hyun/SPARK-31687.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/create-release/release-util.sh | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/dev/create-release/release-util.sh 
b/dev/create-release/release-util.sh
index 8ee94a6..af9ed20 100755
--- a/dev/create-release/release-util.sh
+++ b/dev/create-release/release-util.sh
@@ -19,9 +19,8 @@
 
 DRY_RUN=${DRY_RUN:-0}
 GPG="gpg --no-tty --batch"
-ASF_REPO="https://gitbox.apache.org/repos/asf/spark.git";
-ASF_REPO_WEBUI="https://gitbox.apache.org/repos/asf?p=spark.git";
-ASF_GITHUB_REPO="https://github.com/apache/spark";
+ASF_REPO="https://github.com/apache/spark";
+ASF_REPO_WEBUI="https://raw.githubusercontent.com/apache/spark";
 
 function error {
   echo "$*"
@@ -74,7 +73,7 @@ function fcreate_secure {
 }
 
 function check_for_tag {
-  curl -s --head --fail "$ASF_GITHUB_REPO/releases/tag/$1" > /dev/null
+  curl -s --head --fail "$ASF_REPO/releases/tag/$1" > /dev/null
 }
 
 function get_release_info {
@@ -91,7 +90,7 @@ function get_release_info {
   export GIT_BRANCH=$(read_config "Branch" "$GIT_BRANCH")
 
   # Find the current version for the branch.
-  local VERSION=$(curl -s 
"$ASF_REPO_WEBUI;a=blob_plain;f=pom.xml;hb=refs/heads/$GIT_BRANCH" |
+  local VERSION=$(curl -s "$ASF_REPO_WEBUI/$GIT_BRANCH/pom.xml" |
 parse_version)
   echo "Current branch version is $VERSION."
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result

2020-05-12 Thread dbtsai

This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new e892a01  [SPARK-31683][CORE] Make Prometheus output consistent with 
DropWizard 4.1 result
e892a01 is described below

commit e892a016699d996b959b4db01242cff934d62f76
Author: Dongjoon Hyun 
AuthorDate: Tue May 12 19:57:48 2020 +

[SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 
result

### What changes were proposed in this pull request?

This PR aims to update Prometheus-related output format to be consistent 
with DropWizard 4.1 result.
- Add `Number` metrics for gauges metrics.
- Add `type` labels.

### Why are the changes needed?

SPARK-29032 added Prometheus support. After that, SPARK-29674 upgraded 
DropWizard for JDK9+ support and this caused difference in output labels and 
number of keys for Guage metrics. The current status is different from Apache 
Spark 2.4.5. Since we cannot change DropWizard, this PR aims to be consistent 
in Apache Spark 3.0.0 only.

**DropWizard 3.x**
```
metrics_master_aliveWorkers_Value 1.0
```

**DropWizard 4.1**
```
metrics_master_aliveWorkers_Value{type="gauges",} 1.0
metrics_master_aliveWorkers_Number{type="gauges",} 1.0
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new feature in 3.0.0.

### How was this patch tested?

Manually check the output like the following.

**JMXExporter Result**
```
$ curl -s http://localhost:8088/ | grep "^metrics_master" | sort
metrics_master_aliveWorkers_Number{type="gauges",} 1.0
metrics_master_aliveWorkers_Value{type="gauges",} 1.0
metrics_master_apps_Number{type="gauges",} 0.0
metrics_master_apps_Value{type="gauges",} 0.0
metrics_master_waitingApps_Number{type="gauges",} 0.0
metrics_master_waitingApps_Value{type="gauges",} 0.0
metrics_master_workers_Number{type="gauges",} 1.0
metrics_master_workers_Value{type="gauges",} 1.0
```

**This PR**
```
$ curl -s http://localhost:8080/metrics/master/prometheus/ | grep master
metrics_master_aliveWorkers_Number{type="gauges"} 1
metrics_master_aliveWorkers_Value{type="gauges"} 1
metrics_master_apps_Number{type="gauges"} 0
metrics_master_apps_Value{type="gauges"} 0
metrics_master_waitingApps_Number{type="gauges"} 0
metrics_master_waitingApps_Value{type="gauges"} 0
metrics_master_workers_Number{type="gauges"} 1
metrics_master_workers_Value{type="gauges"} 1
```

Closes #28510 from dongjoon-hyun/SPARK-31683.

Authored-by: Dongjoon Hyun 
Signed-off-by: DB Tsai 
(cherry picked from commit 07209f3e2deab824f04484fa6b8bab0ec0a635d6)
Signed-off-by: DB Tsai 
---
 .../spark/metrics/sink/PrometheusServlet.scala | 73 --
 .../spark/status/api/v1/PrometheusResource.scala   | 52 +++
 2 files changed, 67 insertions(+), 58 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala 
b/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala
index 011c7bc..59b863b 100644
--- a/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala
+++ b/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala
@@ -56,58 +56,65 @@ private[spark] class PrometheusServlet(
   def getMetricsSnapshot(request: HttpServletRequest): String = {
 import scala.collection.JavaConverters._
 
+val guagesLabel = """{type="gauges"}"""
+val countersLabel = """{type="counters"}"""
+val metersLabel = countersLabel
+val histogramslabels = """{type="histograms"}"""
+val timersLabels = """{type="timers"}"""
+
 val sb = new StringBuilder()
 registry.getGauges.asScala.foreach { case (k, v) =>
   if (!v.getValue.isInstanceOf[String]) {
-sb.append(s"${normalizeKey(k)}Value ${v.getValue}\n")
+sb.append(s"${normalizeKey(k)}Number$guagesLabel ${v.getValue}\n")
+sb.append(s"${normalizeKey(k)}Value$guagesLabel ${v.getValue}\n")
   }
 }
 registry.getCounters.asScala.foreach { case (k, v) =>
-  sb.append(s"${normalizeKey(k)}Count ${v.getCount}\n")
+  sb.append(s"${normalizeKey(k)}Count$countersLabel ${v.getCount}\n")
 }
 registry.getHistograms.asScala.foreach { case (k, h) =>
   val snapshot = h.getSnapshot
   val prefix = normalizeKey(k)
-  sb.append(s"${prefix}Count ${h.getCount}\n")
-  sb.append(s"${prefix}Max ${snapshot.getMax}\n")
-  sb.append(s"${prefix}Mean ${snapshot.getMean}\n")
-  sb.append(s"${prefix}Min ${snapshot.getMin}\n")
-  sb.append(s"${prefix}50thPercentile ${snapshot.getMedian}\n")
-  sb.append(s"${prefix}75thPercentile ${snapshot.get75thPer

[spark] branch master updated (6994c64 -> 07209f3e)

2020-05-12 Thread dbtsai

This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6994c64  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener
 add 07209f3e [SPARK-31683][CORE] Make Prometheus output consistent with 
DropWizard 4.1 result

No new revisions were added by this update.

Summary of changes:
 .../spark/metrics/sink/PrometheusServlet.scala | 73 --
 .../spark/status/api/v1/PrometheusResource.scala   | 52 +++
 2 files changed, 67 insertions(+), 58 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 512cb2f  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener
512cb2f is described below

commit 512cb2f0246a0d020f0ba726b4596555b15797c6
Author: Ali Smesseim 
AuthorDate: Tue May 12 09:14:34 2020 -0700

[SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter 
operation/session ID actually exist in the `sessionList` and `executionList` 
respectively. This prevents NullPointerExceptions if the operation or session 
ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by 
`operationManager.closeOperation`. If for any reason this throws an exception, 
other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or 
session ID is unknown. In Spark 2, where the listener is called directly, this 
hampers with the caller's control flow. In Spark 3, the exception is caught by 
the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an 
operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim 
Signed-off-by: gatorsmile 
(cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886)
Signed-off-by: gatorsmile 
---
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
index 6d0a506..20a8f2c 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
@@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.hive.service.server.HiveServer2
 
 import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD
 import org.apache.spark.scheduler._
 import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState
@@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener(
 kvstore: ElementTrackingStore,
 sparkConf: SparkConf,
 server: Option[HiveServer2],
-live: Boolean = true) extends SparkListener {
+live: Boolean = true) extends SparkListener with Logging {
 
   private val sessionList = new ConcurrentHashMap[String, LiveSessionData]()
   private val executionList = new ConcurrentHashMap[String, 
LiveExecutionData]()
@@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener(
 updateLiveStore(session)
   }
 
-  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
= {
-val session = sessionList.get(e.sessionId)
-session.finishTimestamp = e.finishTime
-updateStoreWithTriggerEnabled(session)
-sessionList.remove(e.sessionId)
-  }
+  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
=
+Option(sessionList.get(e.sessionId)) match {
+  case None => logWarning(s"onSessionClosed called with unknown session 
id: ${e.sessionId}")
+  case Some(sessionData) =>
+val session = sessionData
+session.finishTimestamp = e.finishTime
+updateStoreWithTriggerEnabled(session)
+sessionList.remove(e.sessionId)
+}
 
-  private def onOperationStart(e: SparkListenerThriftServerOperationStart): 
Unit = {
-val info = getOrCreateExecution(
-  e.id,
-  e.statement,
-  e.sessionId,
-  e.startTime,
-  e.userName)
-
-info.state = ExecutionState.STARTED
-executionList.put(e.id, info)
-sessionList.get(e.sessionId).totalExecution += 1
-executionList.get(e.id).groupId = e.groupId
-updateLiv

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 512cb2f  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener
512cb2f is described below

commit 512cb2f0246a0d020f0ba726b4596555b15797c6
Author: Ali Smesseim 
AuthorDate: Tue May 12 09:14:34 2020 -0700

[SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter 
operation/session ID actually exist in the `sessionList` and `executionList` 
respectively. This prevents NullPointerExceptions if the operation or session 
ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by 
`operationManager.closeOperation`. If for any reason this throws an exception, 
other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or 
session ID is unknown. In Spark 2, where the listener is called directly, this 
hampers with the caller's control flow. In Spark 3, the exception is caught by 
the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an 
operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim 
Signed-off-by: gatorsmile 
(cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886)
Signed-off-by: gatorsmile 
---
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
index 6d0a506..20a8f2c 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
@@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.hive.service.server.HiveServer2
 
 import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD
 import org.apache.spark.scheduler._
 import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState
@@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener(
 kvstore: ElementTrackingStore,
 sparkConf: SparkConf,
 server: Option[HiveServer2],
-live: Boolean = true) extends SparkListener {
+live: Boolean = true) extends SparkListener with Logging {
 
   private val sessionList = new ConcurrentHashMap[String, LiveSessionData]()
   private val executionList = new ConcurrentHashMap[String, 
LiveExecutionData]()
@@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener(
 updateLiveStore(session)
   }
 
-  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
= {
-val session = sessionList.get(e.sessionId)
-session.finishTimestamp = e.finishTime
-updateStoreWithTriggerEnabled(session)
-sessionList.remove(e.sessionId)
-  }
+  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
=
+Option(sessionList.get(e.sessionId)) match {
+  case None => logWarning(s"onSessionClosed called with unknown session 
id: ${e.sessionId}")
+  case Some(sessionData) =>
+val session = sessionData
+session.finishTimestamp = e.finishTime
+updateStoreWithTriggerEnabled(session)
+sessionList.remove(e.sessionId)
+}
 
-  private def onOperationStart(e: SparkListenerThriftServerOperationStart): 
Unit = {
-val info = getOrCreateExecution(
-  e.id,
-  e.statement,
-  e.sessionId,
-  e.startTime,
-  e.userName)
-
-info.state = ExecutionState.STARTED
-executionList.put(e.id, info)
-sessionList.get(e.sessionId).totalExecution += 1
-executionList.get(e.id).groupId = e.groupId
-updateLiv

[spark] branch master updated (e248bc7 -> 6994c64)

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e248bc7  [SPARK-31610][SPARK-31668][ML] Address hashingTF 
saving&loading bug and expose hashFunc property in HashingTF
 add 6994c64  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

No new revisions were added by this update.

Summary of changes:
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)
 create mode 100644 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 512cb2f  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener
512cb2f is described below

commit 512cb2f0246a0d020f0ba726b4596555b15797c6
Author: Ali Smesseim 
AuthorDate: Tue May 12 09:14:34 2020 -0700

[SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter 
operation/session ID actually exist in the `sessionList` and `executionList` 
respectively. This prevents NullPointerExceptions if the operation or session 
ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by 
`operationManager.closeOperation`. If for any reason this throws an exception, 
other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or 
session ID is unknown. In Spark 2, where the listener is called directly, this 
hampers with the caller's control flow. In Spark 3, the exception is caught by 
the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an 
operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim 
Signed-off-by: gatorsmile 
(cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886)
Signed-off-by: gatorsmile 
---
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
index 6d0a506..20a8f2c 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala
@@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.hive.service.server.HiveServer2
 
 import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD
 import org.apache.spark.scheduler._
 import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState
@@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener(
 kvstore: ElementTrackingStore,
 sparkConf: SparkConf,
 server: Option[HiveServer2],
-live: Boolean = true) extends SparkListener {
+live: Boolean = true) extends SparkListener with Logging {
 
   private val sessionList = new ConcurrentHashMap[String, LiveSessionData]()
   private val executionList = new ConcurrentHashMap[String, 
LiveExecutionData]()
@@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener(
 updateLiveStore(session)
   }
 
-  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
= {
-val session = sessionList.get(e.sessionId)
-session.finishTimestamp = e.finishTime
-updateStoreWithTriggerEnabled(session)
-sessionList.remove(e.sessionId)
-  }
+  private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit 
=
+Option(sessionList.get(e.sessionId)) match {
+  case None => logWarning(s"onSessionClosed called with unknown session 
id: ${e.sessionId}")
+  case Some(sessionData) =>
+val session = sessionData
+session.finishTimestamp = e.finishTime
+updateStoreWithTriggerEnabled(session)
+sessionList.remove(e.sessionId)
+}
 
-  private def onOperationStart(e: SparkListenerThriftServerOperationStart): 
Unit = {
-val info = getOrCreateExecution(
-  e.id,
-  e.statement,
-  e.sessionId,
-  e.startTime,
-  e.userName)
-
-info.state = ExecutionState.STARTED
-executionList.put(e.id, info)
-sessionList.get(e.sessionId).totalExecution += 1
-executionList.get(e.id).groupId = e.groupId
-updateLiv

[spark] branch master updated (e248bc7 -> 6994c64)

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e248bc7  [SPARK-31610][SPARK-31668][ML] Address hashingTF 
saving&loading bug and expose hashFunc property in HashingTF
 add 6994c64  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

No new revisions were added by this update.

Summary of changes:
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)
 create mode 100644 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e248bc7 -> 6994c64)

2020-05-12 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e248bc7  [SPARK-31610][SPARK-31668][ML] Address hashingTF 
saving&loading bug and expose hashFunc property in HashingTF
 add 6994c64  [SPARK-31387] Handle unknown operation/session ID in 
HiveThriftServer2Listener

No new revisions were added by this update.

Summary of changes:
 .../ui/HiveThriftServer2Listener.scala | 120 -
 .../hive/thriftserver/HiveSessionImplSuite.scala   |  73 +
 .../ui/HiveThriftServer2ListenerSuite.scala|  16 +++
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 .../hive/service/cli/session/HiveSessionImpl.java  |   6 +-
 5 files changed, 170 insertions(+), 51 deletions(-)
 create mode 100644 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF

2020-05-12 Thread meng

This is an automated email from the ASF dual-hosted git repository.

meng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e248bc7  [SPARK-31610][SPARK-31668][ML] Address hashingTF 
saving&loading bug and expose hashFunc property in HashingTF
e248bc7 is described below

commit e248bc7af6086cde7dd89a51459ae6a221a600c8
Author: Weichen Xu 
AuthorDate: Tue May 12 08:54:28 2020 -0700

[SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and 
expose hashFunc property in HashingTF

### What changes were proposed in this pull request?
Expose hashFunc property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Why are the changes needed?
See https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Does this PR introduce any user-facing change?
No. Only add a package private constructor.

### How was this patch tested?
N/A

Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc.

Authored-by: Weichen Xu 
Signed-off-by: Xiangrui Meng 
---
 .../org/apache/spark/ml/feature/HashingTF.scala| 40 +-
 .../apache/spark/ml/feature/HashingTFSuite.scala   |  4 +++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
index 80bf859..d2bb013 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
@@ -42,14 +42,17 @@ import org.apache.spark.util.VersionUtils.majorMinorVersion
  * otherwise the features will not be mapped evenly to the columns.
  */
 @Since("1.2.0")
-class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
+class HashingTF @Since("3.0.0") private[ml] (
+@Since("1.4.0") override val uid: String,
+@Since("3.1.0") val hashFuncVersion: Int)
   extends Transformer with HasInputCol with HasOutputCol with HasNumFeatures
 with DefaultParamsWritable {
 
-  private var hashFunc: Any => Int = FeatureHasher.murmur3Hash
-
   @Since("1.2.0")
-  def this() = this(Identifiable.randomUID("hashingTF"))
+  def this() = this(Identifiable.randomUID("hashingTF"), 
HashingTF.SPARK_3_MURMUR3_HASH)
+
+  @Since("1.4.0")
+  def this(uid: String) = this(uid, hashFuncVersion = 
HashingTF.SPARK_3_MURMUR3_HASH)
 
   /** @group setParam */
   @Since("1.4.0")
@@ -122,7 +125,12 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override 
val uid: String)
*/
   @Since("3.0.0")
   def indexOf(term: Any): Int = {
-Utils.nonNegativeMod(hashFunc(term), $(numFeatures))
+val hashValue = hashFuncVersion match {
+  case HashingTF.SPARK_2_MURMUR3_HASH => OldHashingTF.murmur3Hash(term)
+  case HashingTF.SPARK_3_MURMUR3_HASH => FeatureHasher.murmur3Hash(term)
+  case _ => throw new IllegalArgumentException("Illegal hash function 
version setting.")
+}
+Utils.nonNegativeMod(hashValue, $(numFeatures))
   }
 
   @Since("1.4.1")
@@ -132,27 +140,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override 
val uid: String)
   override def toString: String = {
 s"HashingTF: uid=$uid, binary=${$(binary)}, numFeatures=${$(numFeatures)}"
   }
+
+  @Since("3.0.0")
+  override def save(path: String): Unit = {
+require(hashFuncVersion == HashingTF.SPARK_3_MURMUR3_HASH,
+  "Cannot save model which is loaded from lower version spark saved model. 
We can address " +
+  "it by (1) use old spark version to save the model, or (2) use new 
version spark to " +
+  "re-train the pipeline.")
+super.save(path)
+  }
 }
 
 @Since("1.6.0")
 object HashingTF extends DefaultParamsReadable[HashingTF] {
 
+  private[ml] val SPARK_2_MURMUR3_HASH = 1
+  private[ml] val SPARK_3_MURMUR3_HASH = 2
+
   private class HashingTFReader extends MLReader[HashingTF] {
 
 private val className = classOf[HashingTF].getName
 
 override def load(path: String): HashingTF = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
-  val hashingTF = new HashingTF(metadata.uid)
-  metadata.getAndSetParams(hashingTF)
 
   // We support loading old `HashingTF` saved by previous Spark versions.
   // Previous `HashingTF` uses `mllib.feature.HashingTF.murmur3Hash`, but 
new `HashingTF` uses
   // `ml.Feature.FeatureHasher.murmur3Hash`.
   val (majorVersion, _) = majorMinorVersion(metadata.sparkVersion)
-  if (majorVersion < 3) {
-hashingTF.hashFunc = OldHashingTF.murmur3Hash
+  val hashFuncVersion = if (majorVersion < 3) {
+SPARK_2_MURMUR3_HASH
+  } else {
+SPARK_3_MURMUR3_HASH
   }
+  val hashingTF = new Hashing

[spark] branch branch-3.0 updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF

2020-05-12 Thread meng

This is an automated email from the ASF dual-hosted git repository.

meng pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new b50d53b  [SPARK-31610][SPARK-31668][ML] Address hashingTF 
saving&loading bug and expose hashFunc property in HashingTF
b50d53b is described below

commit b50d53b1079ea32c75f9192f27b2b07cdec69641
Author: Weichen Xu 
AuthorDate: Tue May 12 08:54:28 2020 -0700

[SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and 
expose hashFunc property in HashingTF

### What changes were proposed in this pull request?
Expose hashFunc property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Why are the changes needed?
See https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Does this PR introduce any user-facing change?
No. Only add a package private constructor.

### How was this patch tested?
N/A

Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc.

Authored-by: Weichen Xu 
Signed-off-by: Xiangrui Meng 
(cherry picked from commit e248bc7af6086cde7dd89a51459ae6a221a600c8)
Signed-off-by: Xiangrui Meng 
---
 .../org/apache/spark/ml/feature/HashingTF.scala| 40 +-
 .../apache/spark/ml/feature/HashingTFSuite.scala   |  4 +++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
index 80bf859..d2bb013 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
@@ -42,14 +42,17 @@ import org.apache.spark.util.VersionUtils.majorMinorVersion
  * otherwise the features will not be mapped evenly to the columns.
  */
 @Since("1.2.0")
-class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
+class HashingTF @Since("3.0.0") private[ml] (
+@Since("1.4.0") override val uid: String,
+@Since("3.1.0") val hashFuncVersion: Int)
   extends Transformer with HasInputCol with HasOutputCol with HasNumFeatures
 with DefaultParamsWritable {
 
-  private var hashFunc: Any => Int = FeatureHasher.murmur3Hash
-
   @Since("1.2.0")
-  def this() = this(Identifiable.randomUID("hashingTF"))
+  def this() = this(Identifiable.randomUID("hashingTF"), 
HashingTF.SPARK_3_MURMUR3_HASH)
+
+  @Since("1.4.0")
+  def this(uid: String) = this(uid, hashFuncVersion = 
HashingTF.SPARK_3_MURMUR3_HASH)
 
   /** @group setParam */
   @Since("1.4.0")
@@ -122,7 +125,12 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override 
val uid: String)
*/
   @Since("3.0.0")
   def indexOf(term: Any): Int = {
-Utils.nonNegativeMod(hashFunc(term), $(numFeatures))
+val hashValue = hashFuncVersion match {
+  case HashingTF.SPARK_2_MURMUR3_HASH => OldHashingTF.murmur3Hash(term)
+  case HashingTF.SPARK_3_MURMUR3_HASH => FeatureHasher.murmur3Hash(term)
+  case _ => throw new IllegalArgumentException("Illegal hash function 
version setting.")
+}
+Utils.nonNegativeMod(hashValue, $(numFeatures))
   }
 
   @Since("1.4.1")
@@ -132,27 +140,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override 
val uid: String)
   override def toString: String = {
 s"HashingTF: uid=$uid, binary=${$(binary)}, numFeatures=${$(numFeatures)}"
   }
+
+  @Since("3.0.0")
+  override def save(path: String): Unit = {
+require(hashFuncVersion == HashingTF.SPARK_3_MURMUR3_HASH,
+  "Cannot save model which is loaded from lower version spark saved model. 
We can address " +
+  "it by (1) use old spark version to save the model, or (2) use new 
version spark to " +
+  "re-train the pipeline.")
+super.save(path)
+  }
 }
 
 @Since("1.6.0")
 object HashingTF extends DefaultParamsReadable[HashingTF] {
 
+  private[ml] val SPARK_2_MURMUR3_HASH = 1
+  private[ml] val SPARK_3_MURMUR3_HASH = 2
+
   private class HashingTFReader extends MLReader[HashingTF] {
 
 private val className = classOf[HashingTF].getName
 
 override def load(path: String): HashingTF = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
-  val hashingTF = new HashingTF(metadata.uid)
-  metadata.getAndSetParams(hashingTF)
 
   // We support loading old `HashingTF` saved by previous Spark versions.
   // Previous `HashingTF` uses `mllib.feature.HashingTF.murmur3Hash`, but 
new `HashingTF` uses
   // `ml.Feature.FeatureHasher.murmur3Hash`.
   val (majorVersion, _) = majorMinorVersion(metadata.sparkVersion)
-  if (majorVersion < 3) {
-hashingTF.hashFunc = OldHashingTF.murmur3Hash
+  val hashFuncVersion = if (majorVersion < 3) {
+

[spark] branch branch-3.0 updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator

2020-05-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new cbe75bb  [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by 
Random data generator
cbe75bb is described below

commit cbe75bb8879ec408088eaf6944284a893bb63c92
Author: Max Gekk 
AuthorDate: Tue May 12 14:05:31 2020 +

[SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data 
generator

### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType 
by `RandomDataGenerator.forType` when the SQL config 
`spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate 
types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, 
`RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit a3fafddf390fd180047a0b9ef46f052a9b6813e0)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/RandomDataGenerator.scala | 105 +
 .../spark/sql/RandomDataGeneratorSuite.scala   |  32 ---
 .../sql/catalyst/encoders/RowEncoderSuite.scala|  36 +++
 .../spark/sql/sources/HadoopFsRelationTest.scala   |  75 ---
 4 files changed, 146 insertions(+), 102 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
index cf8d772..6a5bdc4 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
@@ -19,13 +19,15 @@ package org.apache.spark.sql
 
 import java.math.MathContext
 import java.sql.{Date, Timestamp}
+import java.time.{Instant, LocalDate, LocalDateTime, ZoneId}
 
 import scala.collection.mutable
 import scala.util.{Random, Try}
 
 import org.apache.spark.sql.catalyst.CatalystTypeConverters
-import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY
+import 
org.apache.spark.sql.catalyst.util.DateTimeConstants.{MICROS_PER_MILLIS, 
MILLIS_PER_DAY}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.CalendarInterval
 /**
@@ -162,7 +164,7 @@ object RandomDataGenerator {
   })
   case BooleanType => Some(() => rand.nextBoolean())
   case DateType =>
-def uniformDateRand(rand: Random): java.sql.Date = {
+def uniformDaysRand(rand: Random): Int = {
   var milliseconds = rand.nextLong() % 25340232959L
   // -6213574080L is the number of milliseconds before January 1, 
1970, 00:00:00 GMT
   // for "0001-01-01 00:00:00.00". We need to find a
@@ -172,27 +174,37 @@ object RandomDataGenerator {
 // January 1, 1970, 00:00:00 GMT for "-12-31 23:59:59.99".
 milliseconds = rand.nextLong() % 25340232959L
   }
-  val date = DateTimeUtils.toJavaDate((milliseconds / 
MILLIS_PER_DAY).toInt)
-  // The generated `date` is based on the hybrid calendar Julian + 
Gregorian since
-  // 1582-10-15 but it should be valid in Proleptic Gregorian calendar 
too which is used
-  // by Spark SQL since version 3.0 (see SPARK-26651). We try to 
convert `date` to
-  // a local date in Proleptic Gregorian calendar to satisfy this 
requirement.
-  // Some years are leap years in Julian calendar but not in Proleptic 
Gregorian calendar.
-  // As the consequence of that, 29 February of such years might not 
exist in Proleptic
-  // Gregorian calendar. When this happens, we shift the date by one 
day.
-  Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + 
MILLIS_PER_DAY))
+  (milliseconds / MILLIS_PER_DAY).toInt
+}
+val specialDates = Seq(
+  "0001-01-01", // the fist day of Common Era
+  "1582-10-15", // the cutover date from Julian to Gregorian calendar
+  "1970-01-01", // the epoch date
+  "-12-31" // the last supported date according to SQL standard
+)
+if (SQLConf.get.getConf(SQLConf.DATETIME_JAVA8API_ENABLED)) {
+  randomNumeric[LocalDate](
+rand,
+(rand: Random) => LocalDate.ofEpochDay(uniformDaysRand(rand)),
+specialDates.map(LocalDate.parse))
+} else {
+  randomNumeric[java.sql

[spark] branch master updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator

2020-05-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3fafdd  [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by 
Random data generator
a3fafdd is described below

commit a3fafddf390fd180047a0b9ef46f052a9b6813e0
Author: Max Gekk 
AuthorDate: Tue May 12 14:05:31 2020 +

[SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data 
generator

### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType 
by `RandomDataGenerator.forType` when the SQL config 
`spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate 
types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, 
`RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/RandomDataGenerator.scala | 105 +
 .../spark/sql/RandomDataGeneratorSuite.scala   |  32 ---
 .../sql/catalyst/encoders/RowEncoderSuite.scala|  36 +++
 .../spark/sql/sources/HadoopFsRelationTest.scala   |  75 ---
 4 files changed, 146 insertions(+), 102 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
index cf8d772..6a5bdc4 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
@@ -19,13 +19,15 @@ package org.apache.spark.sql
 
 import java.math.MathContext
 import java.sql.{Date, Timestamp}
+import java.time.{Instant, LocalDate, LocalDateTime, ZoneId}
 
 import scala.collection.mutable
 import scala.util.{Random, Try}
 
 import org.apache.spark.sql.catalyst.CatalystTypeConverters
-import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY
+import 
org.apache.spark.sql.catalyst.util.DateTimeConstants.{MICROS_PER_MILLIS, 
MILLIS_PER_DAY}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.CalendarInterval
 /**
@@ -162,7 +164,7 @@ object RandomDataGenerator {
   })
   case BooleanType => Some(() => rand.nextBoolean())
   case DateType =>
-def uniformDateRand(rand: Random): java.sql.Date = {
+def uniformDaysRand(rand: Random): Int = {
   var milliseconds = rand.nextLong() % 25340232959L
   // -6213574080L is the number of milliseconds before January 1, 
1970, 00:00:00 GMT
   // for "0001-01-01 00:00:00.00". We need to find a
@@ -172,27 +174,37 @@ object RandomDataGenerator {
 // January 1, 1970, 00:00:00 GMT for "-12-31 23:59:59.99".
 milliseconds = rand.nextLong() % 25340232959L
   }
-  val date = DateTimeUtils.toJavaDate((milliseconds / 
MILLIS_PER_DAY).toInt)
-  // The generated `date` is based on the hybrid calendar Julian + 
Gregorian since
-  // 1582-10-15 but it should be valid in Proleptic Gregorian calendar 
too which is used
-  // by Spark SQL since version 3.0 (see SPARK-26651). We try to 
convert `date` to
-  // a local date in Proleptic Gregorian calendar to satisfy this 
requirement.
-  // Some years are leap years in Julian calendar but not in Proleptic 
Gregorian calendar.
-  // As the consequence of that, 29 February of such years might not 
exist in Proleptic
-  // Gregorian calendar. When this happens, we shift the date by one 
day.
-  Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + 
MILLIS_PER_DAY))
+  (milliseconds / MILLIS_PER_DAY).toInt
+}
+val specialDates = Seq(
+  "0001-01-01", // the fist day of Common Era
+  "1582-10-15", // the cutover date from Julian to Gregorian calendar
+  "1970-01-01", // the epoch date
+  "-12-31" // the last supported date according to SQL standard
+)
+if (SQLConf.get.getConf(SQLConf.DATETIME_JAVA8API_ENABLED)) {
+  randomNumeric[LocalDate](
+rand,
+(rand: Random) => LocalDate.ofEpochDay(uniformDaysRand(rand)),
+specialDates.map(LocalDate.parse))
+} else {
+  randomNumeric[java.sql.Date](
+rand,
+(rand: Random) => {
+  val date = DateTimeUtils.toJavaDate(un

[spark] branch master updated (ce714d8 -> 178ca96)

2020-05-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ce714d8  [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI 
when error occurs
 add 178ca96  [SPARK-31102][SQL] Spark-sql fails to parse when contains 
comment

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/parser/SqlBase.g4  |  2 +-
 .../spark/sql/catalyst/parser/PlanParserSuite.scala  |  7 ++-
 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala| 12 ++--
 .../spark/sql/hive/thriftserver/CliSuite.scala   | 20 +++-
 4 files changed, 24 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs

2020-05-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ce714d8  [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI 
when error occurs
ce714d8 is described below

commit ce714d81894a48e2d06c530674c2190e0483e1b4
Author: Kent Yao 
AuthorDate: Tue May 12 13:37:13 2020 +

[SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error 
occurs

### What changes were proposed in this pull request?

When I was finding the root cause for SPARK-31675, I noticed that it was 
very difficult for me to see what was actually going on, since it output 
nothing else but only
```sql
Error in query: java.lang.IllegalArgumentException: Wrong FS: 
blablah/.hive-staging_blahbla, expected: hdfs://cluster1
```

It is really hard for us to find causes through such a simple error message 
without a certain amount of experience.

In this PR, I propose to print all of the stack traces when 
AnalysisException occurs if there are underlying root causes, also we can 
escape this via `-S` option.

### Why are the changes needed?

In SPARK-11188,

>For analysis exceptions in the sql-shell, we should only print the error 
message to the screen. The stacktrace will never have useful information since 
this error is used to signify an error with the query.

But nowadays, some `AnalysisException`s do have useful information for us 
to debug, e.g. the `AnalysisException` below may contain exceptions from hive 
or Hadoop side.


https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L97-L112

```scala
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468)
at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
```

### Does this PR introduce _any_ user-facing change?

Yes, `bin/spark-sql` will print all the stack trace when an 
AnalysisException which contains root causes occurs, before this fix, only the 
message will be printed.

 before

```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., 
expected: hdfs://hz-cluster10;
```

 After
```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., 
expected: hdfs://hz-cluster10;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: 
Wrong FS: ..., expected: hdfs://hz-cluster10;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
at org.apache.spark.sql.Dataset.(Dataset.scala:229)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
at 
org.apache.spark.sq

[spark] branch branch-3.0 updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs

2020-05-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 2549e38  [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI 
when error occurs
2549e38 is described below

commit 2549e38690fd461c7d01518f1c5df2452efa66b5
Author: Kent Yao 
AuthorDate: Tue May 12 13:37:13 2020 +

[SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error 
occurs

### What changes were proposed in this pull request?

When I was finding the root cause for SPARK-31675, I noticed that it was 
very difficult for me to see what was actually going on, since it output 
nothing else but only
```sql
Error in query: java.lang.IllegalArgumentException: Wrong FS: 
blablah/.hive-staging_blahbla, expected: hdfs://cluster1
```

It is really hard for us to find causes through such a simple error message 
without a certain amount of experience.

In this PR, I propose to print all of the stack traces when 
AnalysisException occurs if there are underlying root causes, also we can 
escape this via `-S` option.

### Why are the changes needed?

In SPARK-11188,

>For analysis exceptions in the sql-shell, we should only print the error 
message to the screen. The stacktrace will never have useful information since 
this error is used to signify an error with the query.

But nowadays, some `AnalysisException`s do have useful information for us 
to debug, e.g. the `AnalysisException` below may contain exceptions from hive 
or Hadoop side.


https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L97-L112

```scala
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468)
at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
```

### Does this PR introduce _any_ user-facing change?

Yes, `bin/spark-sql` will print all the stack trace when an 
AnalysisException which contains root causes occurs, before this fix, only the 
message will be printed.

 before

```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., 
expected: hdfs://hz-cluster10;
```

 After
```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., 
expected: hdfs://hz-cluster10;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: 
Wrong FS: ..., expected: hdfs://hz-cluster10;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
at org.apache.spark.sql.Dataset.(Dataset.scala:229)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
at 
org.apache.

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

2020-05-12 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after 
deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch 
AuthorDate: Tue May 12 08:27:43 2020 -0500

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization

### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not 
guaranteed when saving a RDD to disk and reading it back 
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:

http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I 
use `saveAsTextFile` and then load the resulting file with 
`sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed 
upon way to fix this in Spark itself (see PR #4204 which attempted to fix 
this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch 
Signed-off-by: Sean Owen 
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
the partitions depends on the order the files are returned from the filesystem. 
It may or may not, for example, follow the lexicographic ordering of the files 
by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling 
the number of partitions of the file. By default, Spark creates one partition 
for each block of the file (blocks being 128MB by default in HDFS), but you can 
also ask for a higher number of partitions by passing a larger value. Note that 
you cannot have fewer partitions than blocks.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

2020-05-12 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after 
deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch 
AuthorDate: Tue May 12 08:27:43 2020 -0500

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization

### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not 
guaranteed when saving a RDD to disk and reading it back 
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:

http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I 
use `saveAsTextFile` and then load the resulting file with 
`sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed 
upon way to fix this in Spark itself (see PR #4204 which attempted to fix 
this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch 
Signed-off-by: Sean Owen 
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
the partitions depends on the order the files are returned from the filesystem. 
It may or may not, for example, follow the lexicographic ordering of the files 
by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling 
the number of partitions of the file. By default, Spark creates one partition 
for each block of the file (blocks being 128MB by default in HDFS), but you can 
also ask for a higher number of partitions by passing a larger value. Note that 
you cannot have fewer partitions than blocks.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

[spark] branch master updated (0feb3cb -> 3772154)

[spark] branch branch-3.0 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`

[spark] branch master updated (0feb3cb -> 3772154)

[spark] branch master updated: [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script

[spark] branch branch-3.0 updated: [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result

[spark] branch master updated (6994c64 -> 07209f3e)

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

[spark] branch master updated (e248bc7 -> 6994c64)

[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener

[spark] branch master updated (e248bc7 -> 6994c64)

[spark] branch master updated (e248bc7 -> 6994c64)

[spark] branch master updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF

[spark] branch branch-3.0 updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF

[spark] branch branch-3.0 updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator

[spark] branch master updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator

[spark] branch master updated (ce714d8 -> 178ca96)

[spark] branch master updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs

[spark] branch branch-3.0 updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

23 matches

Site Navigation

Mail list logo

Footer information