[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 5b51880 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` 5b51880 is described below commit 5b51880d88e639896a7ade08137b2e8f71203003 Author: Dongjoon Hyun AuthorDate: Tue May 12 14:24:56 2020 -0700 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ### What changes were proposed in this pull request? This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string. ### Why are the changes needed? SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following. ``` $ echo 'Falling back to archive.apache.org to download Maven' > out $ build/mvn help:evaluate -Dexpression=project.version >> out Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn $ cat out | grep -v INFO | grep -v WARNING | grep -v Download Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT $ cat out | grep -v INFO | grep -v WARNING | grep -vi Download 3.1.0-SNAPSHOT ``` Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85) Signed-off-by: Dongjoon Hyun (cherry picked from commit ce52f61f720783e8eeb3313c763493054599091a) Signed-off-by: Dongjoon Hyun --- dev/create-release/release-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 99a5928..1fd8a30 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then # Run $MVN in a separate command so that 'set -e' does the right thing. TMP=$(mktemp) $MVN help:evaluate -Dexpression=project.version > $TMP - SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download) + SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi Download) rm $TMP fi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 5b51880 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` 5b51880 is described below commit 5b51880d88e639896a7ade08137b2e8f71203003 Author: Dongjoon Hyun AuthorDate: Tue May 12 14:24:56 2020 -0700 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ### What changes were proposed in this pull request? This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string. ### Why are the changes needed? SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following. ``` $ echo 'Falling back to archive.apache.org to download Maven' > out $ build/mvn help:evaluate -Dexpression=project.version >> out Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn $ cat out | grep -v INFO | grep -v WARNING | grep -v Download Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT $ cat out | grep -v INFO | grep -v WARNING | grep -vi Download 3.1.0-SNAPSHOT ``` Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85) Signed-off-by: Dongjoon Hyun (cherry picked from commit ce52f61f720783e8eeb3313c763493054599091a) Signed-off-by: Dongjoon Hyun --- dev/create-release/release-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 99a5928..1fd8a30 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then # Run $MVN in a separate command so that 'set -e' does the right thing. TMP=$(mktemp) $MVN help:evaluate -Dexpression=project.version > $TMP - SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download) + SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi Download) rm $TMP fi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0feb3cb -> 3772154)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0feb3cb [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script add 3772154 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` No new revisions were added by this update. Summary of changes: dev/create-release/release-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new ce52f61 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ce52f61 is described below commit ce52f61f720783e8eeb3313c763493054599091a Author: Dongjoon Hyun AuthorDate: Tue May 12 14:24:56 2020 -0700 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ### What changes were proposed in this pull request? This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string. ### Why are the changes needed? SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following. ``` $ echo 'Falling back to archive.apache.org to download Maven' > out $ build/mvn help:evaluate -Dexpression=project.version >> out Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn $ cat out | grep -v INFO | grep -v WARNING | grep -v Download Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT $ cat out | grep -v INFO | grep -v WARNING | grep -vi Download 3.1.0-SNAPSHOT ``` Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85) Signed-off-by: Dongjoon Hyun --- dev/create-release/release-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 022d3af..655b079 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -103,7 +103,7 @@ if [ -z "$SPARK_VERSION" ]; then # Run $MVN in a separate command so that 'set -e' does the right thing. TMP=$(mktemp) $MVN help:evaluate -Dexpression=project.version > $TMP - SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -v Download) + SPARK_VERSION=$(cat $TMP | grep -v INFO | grep -v WARNING | grep -vi Download) rm $TMP fi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0feb3cb -> 3772154)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0feb3cb [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script add 3772154 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` No new revisions were added by this update. Summary of changes: dev/create-release/release-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0feb3cb [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script 0feb3cb is described below commit 0feb3cbe7759b7903efccba5eb35cafdc08a027e Author: Dongjoon Hyun AuthorDate: Tue May 12 13:07:00 2020 -0700 [SPARK-31687][INFRA] Use GitHub instead of GitBox in release script ### What changes were proposed in this pull request? This PR aims to use GitHub urls instead of GitHub in the release scripts. ### Why are the changes needed? Currently, Spark Packaing jobs are broken due to GitBox issue. ``` fatal: unable to access 'https://gitbox.apache.org/repos/asf/spark.git/': Failed to connect to gitbox.apache.org port 443: Connection timed out ``` - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2906/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-3.0-maven-snapshots/105/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.4-maven-snapshots/439/console ### Does this PR introduce _any_ user-facing change? No. (This is a dev-only script.) ### How was this patch tested? Manual. ``` $ cat ./test.sh . dev/create-release/release-util.sh get_release_info git clone "$ASF_REPO" $ sh test.sh Branch [branch-3.0]: Current branch version is 3.0.1-SNAPSHOT. Release [3.0.0]: RC # [2]: Full name [Dongjoon Hyun]: GPG key [dongjoonapache.org]: Release details: BRANCH: branch-3.0 VERSION:3.0.0 TAG:v3.0.0-rc2 NEXT: 3.0.1-SNAPSHOT ASF USER: dongjoon GPG KEY:dongjoonapache.org FULL NAME: Dongjoon Hyun E-MAIL: dongjoonapache.org Is this info correct [y/n]? y ASF password: GPG passphrase: Cloning into 'spark'... remote: Enumerating objects: 223, done. remote: Counting objects: 100% (223/223), done. remote: Compressing objects: 100% (117/117), done. remote: Total 708324 (delta 70), reused 138 (delta 32), pack-reused 708101 Receiving objects: 100% (708324/708324), 322.08 MiB | 2.94 MiB/s, done. Resolving deltas: 100% (289268/289268), done. Updating files: 100% (16287/16287), done. $ sh test.sh ... ``` Closes #28513 from dongjoon-hyun/SPARK-31687. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/create-release/release-util.sh | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/dev/create-release/release-util.sh b/dev/create-release/release-util.sh index 8ee94a6..af9ed20 100755 --- a/dev/create-release/release-util.sh +++ b/dev/create-release/release-util.sh @@ -19,9 +19,8 @@ DRY_RUN=${DRY_RUN:-0} GPG="gpg --no-tty --batch" -ASF_REPO="https://gitbox.apache.org/repos/asf/spark.git"; -ASF_REPO_WEBUI="https://gitbox.apache.org/repos/asf?p=spark.git"; -ASF_GITHUB_REPO="https://github.com/apache/spark"; +ASF_REPO="https://github.com/apache/spark"; +ASF_REPO_WEBUI="https://raw.githubusercontent.com/apache/spark"; function error { echo "$*" @@ -74,7 +73,7 @@ function fcreate_secure { } function check_for_tag { - curl -s --head --fail "$ASF_GITHUB_REPO/releases/tag/$1" > /dev/null + curl -s --head --fail "$ASF_REPO/releases/tag/$1" > /dev/null } function get_release_info { @@ -91,7 +90,7 @@ function get_release_info { export GIT_BRANCH=$(read_config "Branch" "$GIT_BRANCH") # Find the current version for the branch. - local VERSION=$(curl -s "$ASF_REPO_WEBUI;a=blob_plain;f=pom.xml;hb=refs/heads/$GIT_BRANCH" | + local VERSION=$(curl -s "$ASF_REPO_WEBUI/$GIT_BRANCH/pom.xml" | parse_version) echo "Current branch version is $VERSION." - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e892a01 [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result e892a01 is described below commit e892a016699d996b959b4db01242cff934d62f76 Author: Dongjoon Hyun AuthorDate: Tue May 12 19:57:48 2020 + [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result ### What changes were proposed in this pull request? This PR aims to update Prometheus-related output format to be consistent with DropWizard 4.1 result. - Add `Number` metrics for gauges metrics. - Add `type` labels. ### Why are the changes needed? SPARK-29032 added Prometheus support. After that, SPARK-29674 upgraded DropWizard for JDK9+ support and this caused difference in output labels and number of keys for Guage metrics. The current status is different from Apache Spark 2.4.5. Since we cannot change DropWizard, this PR aims to be consistent in Apache Spark 3.0.0 only. **DropWizard 3.x** ``` metrics_master_aliveWorkers_Value 1.0 ``` **DropWizard 4.1** ``` metrics_master_aliveWorkers_Value{type="gauges",} 1.0 metrics_master_aliveWorkers_Number{type="gauges",} 1.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature in 3.0.0. ### How was this patch tested? Manually check the output like the following. **JMXExporter Result** ``` $ curl -s http://localhost:8088/ | grep "^metrics_master" | sort metrics_master_aliveWorkers_Number{type="gauges",} 1.0 metrics_master_aliveWorkers_Value{type="gauges",} 1.0 metrics_master_apps_Number{type="gauges",} 0.0 metrics_master_apps_Value{type="gauges",} 0.0 metrics_master_waitingApps_Number{type="gauges",} 0.0 metrics_master_waitingApps_Value{type="gauges",} 0.0 metrics_master_workers_Number{type="gauges",} 1.0 metrics_master_workers_Value{type="gauges",} 1.0 ``` **This PR** ``` $ curl -s http://localhost:8080/metrics/master/prometheus/ | grep master metrics_master_aliveWorkers_Number{type="gauges"} 1 metrics_master_aliveWorkers_Value{type="gauges"} 1 metrics_master_apps_Number{type="gauges"} 0 metrics_master_apps_Value{type="gauges"} 0 metrics_master_waitingApps_Number{type="gauges"} 0 metrics_master_waitingApps_Value{type="gauges"} 0 metrics_master_workers_Number{type="gauges"} 1 metrics_master_workers_Value{type="gauges"} 1 ``` Closes #28510 from dongjoon-hyun/SPARK-31683. Authored-by: Dongjoon Hyun Signed-off-by: DB Tsai (cherry picked from commit 07209f3e2deab824f04484fa6b8bab0ec0a635d6) Signed-off-by: DB Tsai --- .../spark/metrics/sink/PrometheusServlet.scala | 73 -- .../spark/status/api/v1/PrometheusResource.scala | 52 +++ 2 files changed, 67 insertions(+), 58 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala b/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala index 011c7bc..59b863b 100644 --- a/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala +++ b/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala @@ -56,58 +56,65 @@ private[spark] class PrometheusServlet( def getMetricsSnapshot(request: HttpServletRequest): String = { import scala.collection.JavaConverters._ +val guagesLabel = """{type="gauges"}""" +val countersLabel = """{type="counters"}""" +val metersLabel = countersLabel +val histogramslabels = """{type="histograms"}""" +val timersLabels = """{type="timers"}""" + val sb = new StringBuilder() registry.getGauges.asScala.foreach { case (k, v) => if (!v.getValue.isInstanceOf[String]) { -sb.append(s"${normalizeKey(k)}Value ${v.getValue}\n") +sb.append(s"${normalizeKey(k)}Number$guagesLabel ${v.getValue}\n") +sb.append(s"${normalizeKey(k)}Value$guagesLabel ${v.getValue}\n") } } registry.getCounters.asScala.foreach { case (k, v) => - sb.append(s"${normalizeKey(k)}Count ${v.getCount}\n") + sb.append(s"${normalizeKey(k)}Count$countersLabel ${v.getCount}\n") } registry.getHistograms.asScala.foreach { case (k, h) => val snapshot = h.getSnapshot val prefix = normalizeKey(k) - sb.append(s"${prefix}Count ${h.getCount}\n") - sb.append(s"${prefix}Max ${snapshot.getMax}\n") - sb.append(s"${prefix}Mean ${snapshot.getMean}\n") - sb.append(s"${prefix}Min ${snapshot.getMin}\n") - sb.append(s"${prefix}50thPercentile ${snapshot.getMedian}\n") - sb.append(s"${prefix}75thPercentile ${snapshot.get75thPer
[spark] branch master updated (6994c64 -> 07209f3e)
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6994c64 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener add 07209f3e [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result No new revisions were added by this update. Summary of changes: .../spark/metrics/sink/PrometheusServlet.scala | 73 -- .../spark/status/api/v1/PrometheusResource.scala | 52 +++ 2 files changed, 67 insertions(+), 58 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 512cb2f [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener 512cb2f is described below commit 512cb2f0246a0d020f0ba726b4596555b15797c6 Author: Ali Smesseim AuthorDate: Tue May 12 09:14:34 2020 -0700 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer. Authored-by: Ali Smesseim Signed-off-by: gatorsmile (cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886) Signed-off-by: gatorsmile --- .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala index 6d0a506..20a8f2c 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala @@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer import org.apache.hive.service.server.HiveServer2 import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.internal.Logging import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD import org.apache.spark.scheduler._ import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState @@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener( kvstore: ElementTrackingStore, sparkConf: SparkConf, server: Option[HiveServer2], -live: Boolean = true) extends SparkListener { +live: Boolean = true) extends SparkListener with Logging { private val sessionList = new ConcurrentHashMap[String, LiveSessionData]() private val executionList = new ConcurrentHashMap[String, LiveExecutionData]() @@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener( updateLiveStore(session) } - private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = { -val session = sessionList.get(e.sessionId) -session.finishTimestamp = e.finishTime -updateStoreWithTriggerEnabled(session) -sessionList.remove(e.sessionId) - } + private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = +Option(sessionList.get(e.sessionId)) match { + case None => logWarning(s"onSessionClosed called with unknown session id: ${e.sessionId}") + case Some(sessionData) => +val session = sessionData +session.finishTimestamp = e.finishTime +updateStoreWithTriggerEnabled(session) +sessionList.remove(e.sessionId) +} - private def onOperationStart(e: SparkListenerThriftServerOperationStart): Unit = { -val info = getOrCreateExecution( - e.id, - e.statement, - e.sessionId, - e.startTime, - e.userName) - -info.state = ExecutionState.STARTED -executionList.put(e.id, info) -sessionList.get(e.sessionId).totalExecution += 1 -executionList.get(e.id).groupId = e.groupId -updateLiv
[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 512cb2f [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener 512cb2f is described below commit 512cb2f0246a0d020f0ba726b4596555b15797c6 Author: Ali Smesseim AuthorDate: Tue May 12 09:14:34 2020 -0700 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer. Authored-by: Ali Smesseim Signed-off-by: gatorsmile (cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886) Signed-off-by: gatorsmile --- .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala index 6d0a506..20a8f2c 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala @@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer import org.apache.hive.service.server.HiveServer2 import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.internal.Logging import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD import org.apache.spark.scheduler._ import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState @@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener( kvstore: ElementTrackingStore, sparkConf: SparkConf, server: Option[HiveServer2], -live: Boolean = true) extends SparkListener { +live: Boolean = true) extends SparkListener with Logging { private val sessionList = new ConcurrentHashMap[String, LiveSessionData]() private val executionList = new ConcurrentHashMap[String, LiveExecutionData]() @@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener( updateLiveStore(session) } - private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = { -val session = sessionList.get(e.sessionId) -session.finishTimestamp = e.finishTime -updateStoreWithTriggerEnabled(session) -sessionList.remove(e.sessionId) - } + private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = +Option(sessionList.get(e.sessionId)) match { + case None => logWarning(s"onSessionClosed called with unknown session id: ${e.sessionId}") + case Some(sessionData) => +val session = sessionData +session.finishTimestamp = e.finishTime +updateStoreWithTriggerEnabled(session) +sessionList.remove(e.sessionId) +} - private def onOperationStart(e: SparkListenerThriftServerOperationStart): Unit = { -val info = getOrCreateExecution( - e.id, - e.statement, - e.sessionId, - e.startTime, - e.userName) - -info.state = ExecutionState.STARTED -executionList.put(e.id, info) -sessionList.get(e.sessionId).totalExecution += 1 -executionList.get(e.id).groupId = e.groupId -updateLiv
[spark] branch master updated (e248bc7 -> 6994c64)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e248bc7 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF add 6994c64 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener No new revisions were added by this update. Summary of changes: .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 512cb2f [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener 512cb2f is described below commit 512cb2f0246a0d020f0ba726b4596555b15797c6 Author: Ali Smesseim AuthorDate: Tue May 12 09:14:34 2020 -0700 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer. Authored-by: Ali Smesseim Signed-off-by: gatorsmile (cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886) Signed-off-by: gatorsmile --- .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala index 6d0a506..20a8f2c 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala @@ -25,6 +25,7 @@ import scala.collection.mutable.ArrayBuffer import org.apache.hive.service.server.HiveServer2 import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.internal.Logging import org.apache.spark.internal.config.Status.LIVE_ENTITY_UPDATE_PERIOD import org.apache.spark.scheduler._ import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.ExecutionState @@ -38,7 +39,7 @@ private[thriftserver] class HiveThriftServer2Listener( kvstore: ElementTrackingStore, sparkConf: SparkConf, server: Option[HiveServer2], -live: Boolean = true) extends SparkListener { +live: Boolean = true) extends SparkListener with Logging { private val sessionList = new ConcurrentHashMap[String, LiveSessionData]() private val executionList = new ConcurrentHashMap[String, LiveExecutionData]() @@ -131,60 +132,81 @@ private[thriftserver] class HiveThriftServer2Listener( updateLiveStore(session) } - private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = { -val session = sessionList.get(e.sessionId) -session.finishTimestamp = e.finishTime -updateStoreWithTriggerEnabled(session) -sessionList.remove(e.sessionId) - } + private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit = +Option(sessionList.get(e.sessionId)) match { + case None => logWarning(s"onSessionClosed called with unknown session id: ${e.sessionId}") + case Some(sessionData) => +val session = sessionData +session.finishTimestamp = e.finishTime +updateStoreWithTriggerEnabled(session) +sessionList.remove(e.sessionId) +} - private def onOperationStart(e: SparkListenerThriftServerOperationStart): Unit = { -val info = getOrCreateExecution( - e.id, - e.statement, - e.sessionId, - e.startTime, - e.userName) - -info.state = ExecutionState.STARTED -executionList.put(e.id, info) -sessionList.get(e.sessionId).totalExecution += 1 -executionList.get(e.id).groupId = e.groupId -updateLiv
[spark] branch master updated (e248bc7 -> 6994c64)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e248bc7 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF add 6994c64 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener No new revisions were added by this update. Summary of changes: .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e248bc7 -> 6994c64)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e248bc7 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF add 6994c64 [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener No new revisions were added by this update. Summary of changes: .../ui/HiveThriftServer2Listener.scala | 120 - .../hive/thriftserver/HiveSessionImplSuite.scala | 73 + .../ui/HiveThriftServer2ListenerSuite.scala| 16 +++ .../hive/service/cli/session/HiveSessionImpl.java | 6 +- .../hive/service/cli/session/HiveSessionImpl.java | 6 +- 5 files changed, 170 insertions(+), 51 deletions(-) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveSessionImplSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF
This is an automated email from the ASF dual-hosted git repository. meng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e248bc7 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF e248bc7 is described below commit e248bc7af6086cde7dd89a51459ae6a221a600c8 Author: Weichen Xu AuthorDate: Tue May 12 08:54:28 2020 -0700 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF ### What changes were proposed in this pull request? Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Why are the changes needed? See https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Does this PR introduce any user-facing change? No. Only add a package private constructor. ### How was this patch tested? N/A Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc. Authored-by: Weichen Xu Signed-off-by: Xiangrui Meng --- .../org/apache/spark/ml/feature/HashingTF.scala| 40 +- .../apache/spark/ml/feature/HashingTFSuite.scala | 4 +++ 2 files changed, 35 insertions(+), 9 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala index 80bf859..d2bb013 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala @@ -42,14 +42,17 @@ import org.apache.spark.util.VersionUtils.majorMinorVersion * otherwise the features will not be mapped evenly to the columns. */ @Since("1.2.0") -class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) +class HashingTF @Since("3.0.0") private[ml] ( +@Since("1.4.0") override val uid: String, +@Since("3.1.0") val hashFuncVersion: Int) extends Transformer with HasInputCol with HasOutputCol with HasNumFeatures with DefaultParamsWritable { - private var hashFunc: Any => Int = FeatureHasher.murmur3Hash - @Since("1.2.0") - def this() = this(Identifiable.randomUID("hashingTF")) + def this() = this(Identifiable.randomUID("hashingTF"), HashingTF.SPARK_3_MURMUR3_HASH) + + @Since("1.4.0") + def this(uid: String) = this(uid, hashFuncVersion = HashingTF.SPARK_3_MURMUR3_HASH) /** @group setParam */ @Since("1.4.0") @@ -122,7 +125,12 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) */ @Since("3.0.0") def indexOf(term: Any): Int = { -Utils.nonNegativeMod(hashFunc(term), $(numFeatures)) +val hashValue = hashFuncVersion match { + case HashingTF.SPARK_2_MURMUR3_HASH => OldHashingTF.murmur3Hash(term) + case HashingTF.SPARK_3_MURMUR3_HASH => FeatureHasher.murmur3Hash(term) + case _ => throw new IllegalArgumentException("Illegal hash function version setting.") +} +Utils.nonNegativeMod(hashValue, $(numFeatures)) } @Since("1.4.1") @@ -132,27 +140,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def toString: String = { s"HashingTF: uid=$uid, binary=${$(binary)}, numFeatures=${$(numFeatures)}" } + + @Since("3.0.0") + override def save(path: String): Unit = { +require(hashFuncVersion == HashingTF.SPARK_3_MURMUR3_HASH, + "Cannot save model which is loaded from lower version spark saved model. We can address " + + "it by (1) use old spark version to save the model, or (2) use new version spark to " + + "re-train the pipeline.") +super.save(path) + } } @Since("1.6.0") object HashingTF extends DefaultParamsReadable[HashingTF] { + private[ml] val SPARK_2_MURMUR3_HASH = 1 + private[ml] val SPARK_3_MURMUR3_HASH = 2 + private class HashingTFReader extends MLReader[HashingTF] { private val className = classOf[HashingTF].getName override def load(path: String): HashingTF = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) - val hashingTF = new HashingTF(metadata.uid) - metadata.getAndSetParams(hashingTF) // We support loading old `HashingTF` saved by previous Spark versions. // Previous `HashingTF` uses `mllib.feature.HashingTF.murmur3Hash`, but new `HashingTF` uses // `ml.Feature.FeatureHasher.murmur3Hash`. val (majorVersion, _) = majorMinorVersion(metadata.sparkVersion) - if (majorVersion < 3) { -hashingTF.hashFunc = OldHashingTF.murmur3Hash + val hashFuncVersion = if (majorVersion < 3) { +SPARK_2_MURMUR3_HASH + } else { +SPARK_3_MURMUR3_HASH } + val hashingTF = new Hashing
[spark] branch branch-3.0 updated: [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF
This is an automated email from the ASF dual-hosted git repository. meng pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new b50d53b [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF b50d53b is described below commit b50d53b1079ea32c75f9192f27b2b07cdec69641 Author: Weichen Xu AuthorDate: Tue May 12 08:54:28 2020 -0700 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF ### What changes were proposed in this pull request? Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Why are the changes needed? See https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Does this PR introduce any user-facing change? No. Only add a package private constructor. ### How was this patch tested? N/A Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc. Authored-by: Weichen Xu Signed-off-by: Xiangrui Meng (cherry picked from commit e248bc7af6086cde7dd89a51459ae6a221a600c8) Signed-off-by: Xiangrui Meng --- .../org/apache/spark/ml/feature/HashingTF.scala| 40 +- .../apache/spark/ml/feature/HashingTFSuite.scala | 4 +++ 2 files changed, 35 insertions(+), 9 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala index 80bf859..d2bb013 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala @@ -42,14 +42,17 @@ import org.apache.spark.util.VersionUtils.majorMinorVersion * otherwise the features will not be mapped evenly to the columns. */ @Since("1.2.0") -class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) +class HashingTF @Since("3.0.0") private[ml] ( +@Since("1.4.0") override val uid: String, +@Since("3.1.0") val hashFuncVersion: Int) extends Transformer with HasInputCol with HasOutputCol with HasNumFeatures with DefaultParamsWritable { - private var hashFunc: Any => Int = FeatureHasher.murmur3Hash - @Since("1.2.0") - def this() = this(Identifiable.randomUID("hashingTF")) + def this() = this(Identifiable.randomUID("hashingTF"), HashingTF.SPARK_3_MURMUR3_HASH) + + @Since("1.4.0") + def this(uid: String) = this(uid, hashFuncVersion = HashingTF.SPARK_3_MURMUR3_HASH) /** @group setParam */ @Since("1.4.0") @@ -122,7 +125,12 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) */ @Since("3.0.0") def indexOf(term: Any): Int = { -Utils.nonNegativeMod(hashFunc(term), $(numFeatures)) +val hashValue = hashFuncVersion match { + case HashingTF.SPARK_2_MURMUR3_HASH => OldHashingTF.murmur3Hash(term) + case HashingTF.SPARK_3_MURMUR3_HASH => FeatureHasher.murmur3Hash(term) + case _ => throw new IllegalArgumentException("Illegal hash function version setting.") +} +Utils.nonNegativeMod(hashValue, $(numFeatures)) } @Since("1.4.1") @@ -132,27 +140,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def toString: String = { s"HashingTF: uid=$uid, binary=${$(binary)}, numFeatures=${$(numFeatures)}" } + + @Since("3.0.0") + override def save(path: String): Unit = { +require(hashFuncVersion == HashingTF.SPARK_3_MURMUR3_HASH, + "Cannot save model which is loaded from lower version spark saved model. We can address " + + "it by (1) use old spark version to save the model, or (2) use new version spark to " + + "re-train the pipeline.") +super.save(path) + } } @Since("1.6.0") object HashingTF extends DefaultParamsReadable[HashingTF] { + private[ml] val SPARK_2_MURMUR3_HASH = 1 + private[ml] val SPARK_3_MURMUR3_HASH = 2 + private class HashingTFReader extends MLReader[HashingTF] { private val className = classOf[HashingTF].getName override def load(path: String): HashingTF = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) - val hashingTF = new HashingTF(metadata.uid) - metadata.getAndSetParams(hashingTF) // We support loading old `HashingTF` saved by previous Spark versions. // Previous `HashingTF` uses `mllib.feature.HashingTF.murmur3Hash`, but new `HashingTF` uses // `ml.Feature.FeatureHasher.murmur3Hash`. val (majorVersion, _) = majorMinorVersion(metadata.sparkVersion) - if (majorVersion < 3) { -hashingTF.hashFunc = OldHashingTF.murmur3Hash + val hashFuncVersion = if (majorVersion < 3) { +
[spark] branch branch-3.0 updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new cbe75bb [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator cbe75bb is described below commit cbe75bb8879ec408088eaf6944284a893bb63c92 Author: Max Gekk AuthorDate: Tue May 12 14:05:31 2020 + [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator ### What changes were proposed in this pull request? Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`. ### Why are the changes needed? To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`. Closes #28502 from MaxGekk/random-java8-datetime. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit a3fafddf390fd180047a0b9ef46f052a9b6813e0) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/RandomDataGenerator.scala | 105 + .../spark/sql/RandomDataGeneratorSuite.scala | 32 --- .../sql/catalyst/encoders/RowEncoderSuite.scala| 36 +++ .../spark/sql/sources/HadoopFsRelationTest.scala | 75 --- 4 files changed, 146 insertions(+), 102 deletions(-) diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala index cf8d772..6a5bdc4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala @@ -19,13 +19,15 @@ package org.apache.spark.sql import java.math.MathContext import java.sql.{Date, Timestamp} +import java.time.{Instant, LocalDate, LocalDateTime, ZoneId} import scala.collection.mutable import scala.util.{Random, Try} import org.apache.spark.sql.catalyst.CatalystTypeConverters -import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY +import org.apache.spark.sql.catalyst.util.DateTimeConstants.{MICROS_PER_MILLIS, MILLIS_PER_DAY} import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.CalendarInterval /** @@ -162,7 +164,7 @@ object RandomDataGenerator { }) case BooleanType => Some(() => rand.nextBoolean()) case DateType => -def uniformDateRand(rand: Random): java.sql.Date = { +def uniformDaysRand(rand: Random): Int = { var milliseconds = rand.nextLong() % 25340232959L // -6213574080L is the number of milliseconds before January 1, 1970, 00:00:00 GMT // for "0001-01-01 00:00:00.00". We need to find a @@ -172,27 +174,37 @@ object RandomDataGenerator { // January 1, 1970, 00:00:00 GMT for "-12-31 23:59:59.99". milliseconds = rand.nextLong() % 25340232959L } - val date = DateTimeUtils.toJavaDate((milliseconds / MILLIS_PER_DAY).toInt) - // The generated `date` is based on the hybrid calendar Julian + Gregorian since - // 1582-10-15 but it should be valid in Proleptic Gregorian calendar too which is used - // by Spark SQL since version 3.0 (see SPARK-26651). We try to convert `date` to - // a local date in Proleptic Gregorian calendar to satisfy this requirement. - // Some years are leap years in Julian calendar but not in Proleptic Gregorian calendar. - // As the consequence of that, 29 February of such years might not exist in Proleptic - // Gregorian calendar. When this happens, we shift the date by one day. - Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY)) + (milliseconds / MILLIS_PER_DAY).toInt +} +val specialDates = Seq( + "0001-01-01", // the fist day of Common Era + "1582-10-15", // the cutover date from Julian to Gregorian calendar + "1970-01-01", // the epoch date + "-12-31" // the last supported date according to SQL standard +) +if (SQLConf.get.getConf(SQLConf.DATETIME_JAVA8API_ENABLED)) { + randomNumeric[LocalDate]( +rand, +(rand: Random) => LocalDate.ofEpochDay(uniformDaysRand(rand)), +specialDates.map(LocalDate.parse)) +} else { + randomNumeric[java.sql
[spark] branch master updated: [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a3fafdd [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator a3fafdd is described below commit a3fafddf390fd180047a0b9ef46f052a9b6813e0 Author: Max Gekk AuthorDate: Tue May 12 14:05:31 2020 + [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator ### What changes were proposed in this pull request? Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`. ### Why are the changes needed? To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`. Closes #28502 from MaxGekk/random-java8-datetime. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/RandomDataGenerator.scala | 105 + .../spark/sql/RandomDataGeneratorSuite.scala | 32 --- .../sql/catalyst/encoders/RowEncoderSuite.scala| 36 +++ .../spark/sql/sources/HadoopFsRelationTest.scala | 75 --- 4 files changed, 146 insertions(+), 102 deletions(-) diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala index cf8d772..6a5bdc4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala @@ -19,13 +19,15 @@ package org.apache.spark.sql import java.math.MathContext import java.sql.{Date, Timestamp} +import java.time.{Instant, LocalDate, LocalDateTime, ZoneId} import scala.collection.mutable import scala.util.{Random, Try} import org.apache.spark.sql.catalyst.CatalystTypeConverters -import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY +import org.apache.spark.sql.catalyst.util.DateTimeConstants.{MICROS_PER_MILLIS, MILLIS_PER_DAY} import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.CalendarInterval /** @@ -162,7 +164,7 @@ object RandomDataGenerator { }) case BooleanType => Some(() => rand.nextBoolean()) case DateType => -def uniformDateRand(rand: Random): java.sql.Date = { +def uniformDaysRand(rand: Random): Int = { var milliseconds = rand.nextLong() % 25340232959L // -6213574080L is the number of milliseconds before January 1, 1970, 00:00:00 GMT // for "0001-01-01 00:00:00.00". We need to find a @@ -172,27 +174,37 @@ object RandomDataGenerator { // January 1, 1970, 00:00:00 GMT for "-12-31 23:59:59.99". milliseconds = rand.nextLong() % 25340232959L } - val date = DateTimeUtils.toJavaDate((milliseconds / MILLIS_PER_DAY).toInt) - // The generated `date` is based on the hybrid calendar Julian + Gregorian since - // 1582-10-15 but it should be valid in Proleptic Gregorian calendar too which is used - // by Spark SQL since version 3.0 (see SPARK-26651). We try to convert `date` to - // a local date in Proleptic Gregorian calendar to satisfy this requirement. - // Some years are leap years in Julian calendar but not in Proleptic Gregorian calendar. - // As the consequence of that, 29 February of such years might not exist in Proleptic - // Gregorian calendar. When this happens, we shift the date by one day. - Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY)) + (milliseconds / MILLIS_PER_DAY).toInt +} +val specialDates = Seq( + "0001-01-01", // the fist day of Common Era + "1582-10-15", // the cutover date from Julian to Gregorian calendar + "1970-01-01", // the epoch date + "-12-31" // the last supported date according to SQL standard +) +if (SQLConf.get.getConf(SQLConf.DATETIME_JAVA8API_ENABLED)) { + randomNumeric[LocalDate]( +rand, +(rand: Random) => LocalDate.ofEpochDay(uniformDaysRand(rand)), +specialDates.map(LocalDate.parse)) +} else { + randomNumeric[java.sql.Date]( +rand, +(rand: Random) => { + val date = DateTimeUtils.toJavaDate(un
[spark] branch master updated (ce714d8 -> 178ca96)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ce714d8 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs add 178ca96 [SPARK-31102][SQL] Spark-sql fails to parse when contains comment No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/parser/SqlBase.g4 | 2 +- .../spark/sql/catalyst/parser/PlanParserSuite.scala | 7 ++- .../sql/hive/thriftserver/SparkSQLCLIDriver.scala| 12 ++-- .../spark/sql/hive/thriftserver/CliSuite.scala | 20 +++- 4 files changed, 24 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ce714d8 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs ce714d8 is described below commit ce714d81894a48e2d06c530674c2190e0483e1b4 Author: Kent Yao AuthorDate: Tue May 12 13:37:13 2020 + [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs ### What changes were proposed in this pull request? When I was finding the root cause for SPARK-31675, I noticed that it was very difficult for me to see what was actually going on, since it output nothing else but only ```sql Error in query: java.lang.IllegalArgumentException: Wrong FS: blablah/.hive-staging_blahbla, expected: hdfs://cluster1 ``` It is really hard for us to find causes through such a simple error message without a certain amount of experience. In this PR, I propose to print all of the stack traces when AnalysisException occurs if there are underlying root causes, also we can escape this via `-S` option. ### Why are the changes needed? In SPARK-11188, >For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query. But nowadays, some `AnalysisException`s do have useful information for us to debug, e.g. the `AnalysisException` below may contain exceptions from hive or Hadoop side. https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L97-L112 ```scala at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468) at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398) at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593) ``` ### Does this PR introduce _any_ user-facing change? Yes, `bin/spark-sql` will print all the stack trace when an AnalysisException which contains root causes occurs, before this fix, only the message will be printed. before ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; ``` After ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sq
[spark] branch branch-3.0 updated: [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2549e38 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs 2549e38 is described below commit 2549e38690fd461c7d01518f1c5df2452efa66b5 Author: Kent Yao AuthorDate: Tue May 12 13:37:13 2020 + [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs ### What changes were proposed in this pull request? When I was finding the root cause for SPARK-31675, I noticed that it was very difficult for me to see what was actually going on, since it output nothing else but only ```sql Error in query: java.lang.IllegalArgumentException: Wrong FS: blablah/.hive-staging_blahbla, expected: hdfs://cluster1 ``` It is really hard for us to find causes through such a simple error message without a certain amount of experience. In this PR, I propose to print all of the stack traces when AnalysisException occurs if there are underlying root causes, also we can escape this via `-S` option. ### Why are the changes needed? In SPARK-11188, >For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query. But nowadays, some `AnalysisException`s do have useful information for us to debug, e.g. the `AnalysisException` below may contain exceptions from hive or Hadoop side. https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L97-L112 ```scala at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468) at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398) at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593) ``` ### Does this PR introduce _any_ user-facing change? Yes, `bin/spark-sql` will print all the stack trace when an AnalysisException which contains root causes occurs, before this fix, only the message will be printed. before ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; ``` After ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.
[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59d9099 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization 59d9099 is described below commit 59d90997a52f78450fefbc96beba1d731b3678a1 Author: Antonin Delpeuch AuthorDate: Tue May 12 08:27:43 2020 -0500 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization ### What changes were proposed in this pull request? This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)). I added two sentences about this in the RDD Programming Guide. The issue was discussed on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html ### Why are the changes needed? Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order. This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this. ### Does this PR introduce _any_ user-facing change? Yes, two new sentences in the documentation. ### How was this patch tested? By checking that the documentation looks good. Closes #28465 from wetneb/SPARK-5300-docs. Authored-by: Antonin Delpeuch Signed-off-by: Sean Owen --- docs/rdd-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index ba99007..70bfefc 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -360,7 +360,7 @@ Some notes on reading files with Spark: * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system. -* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. +* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partiti [...] * The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59d9099 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization 59d9099 is described below commit 59d90997a52f78450fefbc96beba1d731b3678a1 Author: Antonin Delpeuch AuthorDate: Tue May 12 08:27:43 2020 -0500 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization ### What changes were proposed in this pull request? This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)). I added two sentences about this in the RDD Programming Guide. The issue was discussed on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html ### Why are the changes needed? Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order. This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this. ### Does this PR introduce _any_ user-facing change? Yes, two new sentences in the documentation. ### How was this patch tested? By checking that the documentation looks good. Closes #28465 from wetneb/SPARK-5300-docs. Authored-by: Antonin Delpeuch Signed-off-by: Sean Owen --- docs/rdd-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index ba99007..70bfefc 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -360,7 +360,7 @@ Some notes on reading files with Spark: * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system. -* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. +* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partiti [...] * The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org