svn commit: r32006 - in /dev/spark/v2.3.3-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/spark
Author: yamamuro Date: Thu Jan 17 05:16:42 2019 New Revision: 32006 Log: Apache Spark v2.3.3-rc1 docs [This commit notification would consist of 1447 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] felixcheung commented on issue #172: Update documentation.md
felixcheung commented on issue #172: Update documentation.md URL: https://github.com/apache/spark-website/pull/172#issuecomment-455042131 tbh, we don't list every book though. but if you'd like, you will need to run the command in the PR description to re-generate the HTML, otherwise the book will not show up on the web page. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r32005 - /dev/spark/v2.3.3-rc1-bin/
Author: yamamuro Date: Thu Jan 17 04:53:00 2019 New Revision: 32005 Log: Removing RC artifacts. Removed: dev/spark/v2.3.3-rc1-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r32004 - /dev/spark/v2.3.3-rc1-docs/
Author: yamamuro Date: Thu Jan 17 04:51:37 2019 New Revision: 32004 Log: Removing RC artifacts. Removed: dev/spark/v2.3.3-rc1-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r32003 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_20_13-4915cb3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Jan 17 04:25:17 2019 New Revision: 32003 Log: Apache Spark 3.0.0-SNAPSHOT-2019_01_16_20_13-4915cb3 docs [This commit notification would consist of 1778 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][BUILD] ensure call to translate_component has correct number of arguments
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4915cb3 [MINOR][BUILD] ensure call to translate_component has correct number of arguments 4915cb3 is described below commit 4915cb3adff6f1ec9638b24bfb74f83940ec3d95 Author: wright AuthorDate: Wed Jan 16 21:00:58 2019 -0600 [MINOR][BUILD] ensure call to translate_component has correct number of arguments ## What changes were proposed in this pull request? The call to `translate_component` only supplied 2 out of the 3 required arguments. I added a default empty list for the missing argument to avoid a run-time error. I work for Semmle, and noticed the bug with our LGTM code analyzer: https://lgtm.com/projects/g/apache/spark/snapshot/0655f1624ff7b73e5c8937ab9e83453a5a3a4466/files/dev/create-release/releaseutils.py?sort=name=ASC=heatmap#x1434915b6576fb40:1 ## How was this patch tested? I checked that `./dev/run-tests` pass OK. Closes #23567 from ipwright/wrong-number-of-arguments-fix. Authored-by: wright Signed-off-by: Sean Owen --- dev/create-release/releaseutils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/create-release/releaseutils.py b/dev/create-release/releaseutils.py index f273b33..a5a26ae 100755 --- a/dev/create-release/releaseutils.py +++ b/dev/create-release/releaseutils.py @@ -236,7 +236,7 @@ def translate_component(component, commit_hash, warnings): # The returned components are already filtered and translated def find_components(commit, commit_hash): components = re.findall(r"\[\w*\]", commit.lower()) -components = [translate_component(c, commit_hash) +components = [translate_component(c, commit_hash, []) for c in components if c in known_components] return components - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories.
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 38f0307 [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories. 38f0307 is described below commit 38f030725c561979ca98b2a6cc7ca6c02a1f80ed Author: Jungtaek Lim (HeartSaVioR) AuthorDate: Wed Jan 16 20:57:21 2019 -0600 [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories. ## What changes were proposed in this pull request? The PR makes hardcoded configs below to use `ConfigEntry`. * spark.kryo * spark.kryoserializer * spark.serializer * spark.jars * spark.files * spark.submit * spark.deploy * spark.worker This patch doesn't change configs which are not relevant to SparkConf (e.g. system properties). ## How was this patch tested? Existing tests. Closes #23532 from HeartSaVioR/SPARK-26466-v2. Authored-by: Jungtaek Lim (HeartSaVioR) Signed-off-by: Sean Owen --- .../main/scala/org/apache/spark/SparkConf.scala| 21 +++ .../main/scala/org/apache/spark/SparkContext.scala | 4 +- .../src/main/scala/org/apache/spark/SparkEnv.scala | 9 ++- .../main/scala/org/apache/spark/api/r/RUtils.scala | 3 +- .../apache/spark/deploy/FaultToleranceTest.scala | 6 +- .../org/apache/spark/deploy/SparkCuratorUtil.scala | 3 +- .../org/apache/spark/deploy/SparkSubmit.scala | 29 + .../org/apache/spark/deploy/master/Master.scala| 48 +++ .../spark/deploy/master/RecoveryModeFactory.scala | 7 ++- .../master/ZooKeeperLeaderElectionAgent.scala | 5 +- .../deploy/master/ZooKeeperPersistenceEngine.scala | 15 ++--- .../apache/spark/deploy/worker/DriverRunner.scala | 6 +- .../org/apache/spark/deploy/worker/Worker.scala| 20 +++ .../spark/deploy/worker/WorkerArguments.scala | 5 +- .../spark/deploy/worker/ui/WorkerWebUI.scala | 2 - .../org/apache/spark/internal/config/Deploy.scala | 68 ++ .../org/apache/spark/internal/config/Kryo.scala| 57 ++ .../org/apache/spark/internal/config/Worker.scala | 63 .../org/apache/spark/internal/config/package.scala | 31 ++ .../apache/spark/serializer/JavaSerializer.scala | 5 +- .../apache/spark/serializer/KryoSerializer.scala | 29 - .../main/scala/org/apache/spark/util/Utils.scala | 12 ++-- .../org/apache/spark/JobCancellationSuite.scala| 3 +- .../test/scala/org/apache/spark/ShuffleSuite.scala | 3 +- .../scala/org/apache/spark/SparkConfSuite.scala| 31 +- .../spark/api/python/PythonBroadcastSuite.scala| 3 +- .../apache/spark/broadcast/BroadcastSuite.scala| 3 +- .../org/apache/spark/deploy/SparkSubmitSuite.scala | 34 +-- .../apache/spark/deploy/master/MasterSuite.scala | 14 ++--- .../deploy/master/PersistenceEngineSuite.scala | 3 +- .../deploy/rest/SubmitRestProtocolSuite.scala | 3 +- .../apache/spark/deploy/worker/WorkerSuite.scala | 9 +-- .../apache/spark/scheduler/MapStatusSuite.scala| 2 +- .../serializer/GenericAvroSerializerSuite.scala| 3 +- .../apache/spark/serializer/KryoBenchmark.scala| 8 ++- .../spark/serializer/KryoSerializerBenchmark.scala | 8 ++- .../KryoSerializerDistributedSuite.scala | 4 +- .../KryoSerializerResizableOutputSuite.scala | 14 +++-- .../spark/serializer/KryoSerializerSuite.scala | 44 +++--- .../serializer/SerializerPropertiesSuite.scala | 3 +- .../serializer/UnsafeKryoSerializerSuite.scala | 6 +- .../spark/storage/FlatmapIteratorSuite.scala | 4 +- .../scala/org/apache/spark/util/UtilsSuite.scala | 3 +- .../collection/ExternalAppendOnlyMapSuite.scala| 4 +- .../util/collection/ExternalSorterSuite.scala | 7 ++- .../apache/spark/ml/attribute/AttributeSuite.scala | 3 +- .../apache/spark/ml/feature/InstanceSuite.scala| 3 +- .../spark/ml/feature/LabeledPointSuite.scala | 3 +- .../apache/spark/ml/tree/impl/TreePointSuite.scala | 3 +- .../spark/mllib/clustering/KMeansSuite.scala | 3 +- .../apache/spark/mllib/feature/Word2VecSuite.scala | 19 -- .../apache/spark/mllib/linalg/MatricesSuite.scala | 3 +- .../apache/spark/mllib/linalg/VectorsSuite.scala | 3 +- .../spark/mllib/regression/LabeledPointSuite.scala | 3 +- .../distribution/MultivariateGaussianSuite.scala | 3 +- .../k8s/features/BasicDriverFeatureStep.scala | 11 ++-- .../k8s/features/DriverCommandFeatureStep.scala| 13 +++-- .../k8s/features/BasicDriverFeatureStepSuite.scala | 6 +- .../integrationtest/KubernetesTestComponents.scala | 3 +- .../deploy/mesos/MesosClusterDispatcher.scala | 1 + .../org/apache/spark/deploy/mesos/config.scala | 12
[spark] branch master updated: [SPARK-26600] Update spark-submit usage message
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 272428d [SPARK-26600] Update spark-submit usage message 272428d is described below commit 272428db6fabf43ceaa7d0ee9297916e5f9b5c24 Author: Luca Canali AuthorDate: Wed Jan 16 20:55:28 2019 -0600 [SPARK-26600] Update spark-submit usage message ## What changes were proposed in this pull request? Spark-submit usage message should be put in sync with recent changes in particular regarding K8S support. These are the proposed changes to the usage message: --executor-cores NUM -> can be useed for Spark on YARN and K8S --principal PRINCIPAL and --keytab KEYTAB -> can be used for Spark on YARN and K8S --total-executor-cores NUM> can be used for Spark standalone, YARN and K8S In addition this PR proposes to remove certain implementation details from the --keytab argument description as the implementation details vary between YARN and K8S, for example. ## How was this patch tested? Manually tested Closes #23518 from LucaCanali/updateSparkSubmitArguments. Authored-by: Luca Canali Signed-off-by: Sean Owen --- .../apache/spark/deploy/SparkSubmitArguments.scala | 23 +++--- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala index 34facd5..f5e4c4a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala @@ -576,27 +576,26 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S | --kill SUBMISSION_IDIf given, kills the driver specified. | --status SUBMISSION_ID If given, requests the status of the driver specified. | -| Spark standalone and Mesos only: +| Spark standalone, Mesos and Kubernetes only: | --total-executor-cores NUM Total cores for all executors. | -| Spark standalone and YARN only: -| --executor-cores NUMNumber of cores per executor. (Default: 1 in YARN mode, -| or all available cores on the worker in standalone mode) +| Spark standalone, YARN and Kubernetes only: +| --executor-cores NUMNumber of cores used by each executor. (Default: 1 in +| YARN and K8S modes, or all available cores on the worker +| in standalone mode). | -| YARN-only: +| Spark on YARN and Kubernetes only: +| --principal PRINCIPAL Principal to be used to login to KDC. +| --keytab KEYTAB The full path to the file that contains the keytab for the +| principal specified above. +| +| Spark on YARN only: | --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). | --num-executors NUM Number of executors to launch (Default: 2). | If dynamic allocation is enabled, the initial number of | executors will be at least NUM. | --archives ARCHIVES Comma separated list of archives to be extracted into the | working directory of each executor. -| --principal PRINCIPAL Principal to be used to login to KDC, while running on -| secure HDFS. -| --keytab KEYTAB The full path to the file that contains the keytab for the -| principal specified above. This keytab will be copied to -| the node running the Application Master via the Secure -| Distributed Cache, for renewing the login tickets and the -| delegation tokens periodically. """.stripMargin ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r32000 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_17_46-d608325-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu Jan 17 02:01:07 2019 New Revision: 32000 Log: Apache Spark 2.4.1-SNAPSHOT-2019_01_16_17_46-d608325 docs [This commit notification would consist of 1476 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r31996 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_15_24-dc3b35c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 23:36:36 2019 New Revision: 31996 Log: Apache Spark 3.0.0-SNAPSHOT-2019_01_16_15_24-dc3b35c docs [This commit notification would consist of 1777 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new d608325 [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream d608325 is described below commit d608325a9fbefbe3a05e2bc8f6a95b2fa9c4174b Author: Kris Mok AuthorDate: Wed Jan 16 15:21:11 2019 -0800 [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream ## What changes were proposed in this pull request? Add `ExecutorClassLoader.getResourceAsStream`, so that classes dynamically generated by the REPL can be accessed by user code as `InputStream`s for non-class-loading purposes, such as reading the class file for extracting method/constructor parameter names. Caveat: The convention in Java's `ClassLoader` is that `ClassLoader.getResourceAsStream()` should be considered as a convenience method of `ClassLoader.getResource()`, where the latter provides a `URL` for the resource, and the former invokes `openStream()` on it to serve the resource as an `InputStream`. The former should also catch `IOException` from `openStream()` and convert it to `null`. This PR breaks this convention by only overriding `ClassLoader.getResourceAsStream()` instead of also overriding `ClassLoader.getResource()`, so after this PR, it would be possible to get a non-null result from the former, but get a null result from the latter. This isn't ideal, but it's sufficient to cover the main use case and practically it shouldn't matter. To implement the convention properly, we'd need to register a URL protocol handler with Java to allow it to properly handle the `spark://` protocol, etc, which sounds like an overkill for the intent of this PR. Credit goes to zsxwing for the initial investigation and fix suggestion. ## How was this patch tested? Added new test case in `ExecutorClassLoaderSuite` and `ReplSuite`. Closes #23558 from rednaxelafx/executorclassloader-getresourceasstream. Authored-by: Kris Mok Signed-off-by: gatorsmile (cherry picked from commit dc3b35c5da42def803dd05e2db7506714018e27b) Signed-off-by: gatorsmile --- .../apache/spark/repl/ExecutorClassLoader.scala| 31 +++-- .../spark/repl/ExecutorClassLoaderSuite.scala | 11 .../scala/org/apache/spark/repl/ReplSuite.scala| 32 ++ 3 files changed, 72 insertions(+), 2 deletions(-) diff --git a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala index 88eb0ad..a4a11f0 100644 --- a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala +++ b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala @@ -33,8 +33,11 @@ import org.apache.spark.util.ParentClassLoader /** * A ClassLoader that reads classes from a Hadoop FileSystem or Spark RPC endpoint, used to load * classes defined by the interpreter when the REPL is used. Allows the user to specify if user - * class path should be first. This class loader delegates getting/finding resources to parent - * loader, which makes sense until REPL never provide resource dynamically. + * class path should be first. + * This class loader delegates getting/finding resources to parent loader, which makes sense because + * the REPL never produce resources dynamically. One exception is when getting a Class file as + * resource stream, in which case we will try to fetch the Class file in the same way as loading + * the class, so that dynamically generated Classes from the REPL can be picked up. * * Note: [[ClassLoader]] will preferentially load class from parent. Only when parent is null or * the load failed, that it will call the overridden `findClass` function. To avoid the potential @@ -71,6 +74,30 @@ class ExecutorClassLoader( parentLoader.getResources(name) } + override def getResourceAsStream(name: String): InputStream = { +if (userClassPathFirst) { + val res = getClassResourceAsStreamLocally(name) + if (res != null) res else parentLoader.getResourceAsStream(name) +} else { + val res = parentLoader.getResourceAsStream(name) + if (res != null) res else getClassResourceAsStreamLocally(name) +} + } + + private def getClassResourceAsStreamLocally(name: String): InputStream = { +// Class files can be dynamically generated from the REPL. Allow this class loader to +// load such files for purposes other than loading the class. +try { + if (name.endsWith(".class")) fetchFn(name) else null +} catch { + // The helper functions referenced by fetchFn throw CNFE to indicate failure to fetch + // the class. It matches what IOException was supposed to be used for, and + //
[spark] branch master updated: [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dc3b35c [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream dc3b35c is described below commit dc3b35c5da42def803dd05e2db7506714018e27b Author: Kris Mok AuthorDate: Wed Jan 16 15:21:11 2019 -0800 [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream ## What changes were proposed in this pull request? Add `ExecutorClassLoader.getResourceAsStream`, so that classes dynamically generated by the REPL can be accessed by user code as `InputStream`s for non-class-loading purposes, such as reading the class file for extracting method/constructor parameter names. Caveat: The convention in Java's `ClassLoader` is that `ClassLoader.getResourceAsStream()` should be considered as a convenience method of `ClassLoader.getResource()`, where the latter provides a `URL` for the resource, and the former invokes `openStream()` on it to serve the resource as an `InputStream`. The former should also catch `IOException` from `openStream()` and convert it to `null`. This PR breaks this convention by only overriding `ClassLoader.getResourceAsStream()` instead of also overriding `ClassLoader.getResource()`, so after this PR, it would be possible to get a non-null result from the former, but get a null result from the latter. This isn't ideal, but it's sufficient to cover the main use case and practically it shouldn't matter. To implement the convention properly, we'd need to register a URL protocol handler with Java to allow it to properly handle the `spark://` protocol, etc, which sounds like an overkill for the intent of this PR. Credit goes to zsxwing for the initial investigation and fix suggestion. ## How was this patch tested? Added new test case in `ExecutorClassLoaderSuite` and `ReplSuite`. Closes #23558 from rednaxelafx/executorclassloader-getresourceasstream. Authored-by: Kris Mok Signed-off-by: gatorsmile --- .../apache/spark/repl/ExecutorClassLoader.scala| 31 +++-- .../spark/repl/ExecutorClassLoaderSuite.scala | 11 .../scala/org/apache/spark/repl/ReplSuite.scala| 32 ++ 3 files changed, 72 insertions(+), 2 deletions(-) diff --git a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala index 3176502..177bce2 100644 --- a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala +++ b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala @@ -33,8 +33,11 @@ import org.apache.spark.util.ParentClassLoader /** * A ClassLoader that reads classes from a Hadoop FileSystem or Spark RPC endpoint, used to load * classes defined by the interpreter when the REPL is used. Allows the user to specify if user - * class path should be first. This class loader delegates getting/finding resources to parent - * loader, which makes sense until REPL never provide resource dynamically. + * class path should be first. + * This class loader delegates getting/finding resources to parent loader, which makes sense because + * the REPL never produce resources dynamically. One exception is when getting a Class file as + * resource stream, in which case we will try to fetch the Class file in the same way as loading + * the class, so that dynamically generated Classes from the REPL can be picked up. * * Note: [[ClassLoader]] will preferentially load class from parent. Only when parent is null or * the load failed, that it will call the overridden `findClass` function. To avoid the potential @@ -71,6 +74,30 @@ class ExecutorClassLoader( parentLoader.getResources(name) } + override def getResourceAsStream(name: String): InputStream = { +if (userClassPathFirst) { + val res = getClassResourceAsStreamLocally(name) + if (res != null) res else parentLoader.getResourceAsStream(name) +} else { + val res = parentLoader.getResourceAsStream(name) + if (res != null) res else getClassResourceAsStreamLocally(name) +} + } + + private def getClassResourceAsStreamLocally(name: String): InputStream = { +// Class files can be dynamically generated from the REPL. Allow this class loader to +// load such files for purposes other than loading the class. +try { + if (name.endsWith(".class")) fetchFn(name) else null +} catch { + // The helper functions referenced by fetchFn throw CNFE to indicate failure to fetch + // the class. It matches what IOException was supposed to be used for, and + // ClassLoader.getResourceAsStream() catches IOException and returns null in that case. + // So we follow that model and handle CNFE
svn commit: r31994 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_12_58-1843c16-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 21:14:11 2019 New Revision: 31994 Log: Apache Spark 2.4.1-SNAPSHOT-2019_01_16_12_58-1843c16 docs [This commit notification would consist of 1476 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r31993 - in /dev/spark/2.3.4-SNAPSHOT-2019_01_16_12_57-c0fc6d0-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 21:12:23 2019 New Revision: 31993 Log: Apache Spark 2.3.4-SNAPSHOT-2019_01_16_12_57-c0fc6d0 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26625] Add oauthToken to spark.redaction.regex
This is an automated email from the ASF dual-hosted git repository. mcheah pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 01301d0 [SPARK-26625] Add oauthToken to spark.redaction.regex 01301d0 is described below commit 01301d09721cc12f1cc66ab52de3da117f5d33e6 Author: Vinoo Ganesh AuthorDate: Wed Jan 16 11:43:10 2019 -0800 [SPARK-26625] Add oauthToken to spark.redaction.regex ## What changes were proposed in this pull request? The regex (spark.redaction.regex) that is used to decide which config properties or environment settings are sensitive should also include oauthToken to match spark.kubernetes.authenticate.submission.oauthToken ## How was this patch tested? Simple regex addition - happy to add a test if needed. Author: Vinoo Ganesh Closes #23555 from vinooganesh/vinooganesh/SPARK-26625. --- core/src/main/scala/org/apache/spark/internal/config/package.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 95becfa..0e78637 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -512,7 +512,7 @@ package object config { "a property key or value, the value is redacted from the environment UI and various logs " + "like YARN and event logs.") .regexConf - .createWithDefault("(?i)secret|password".r) + .createWithDefault("(?i)secret|password|token".r) private[spark] val STRING_REDACTION_PATTERN = ConfigBuilder("spark.redaction.string.regex") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26619][SQL] Prune the unused serializers from SerializeFromObject
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8f17078 [SPARK-26619][SQL] Prune the unused serializers from SerializeFromObject 8f17078 is described below commit 8f170787d24912c3a94ce1510197820c87df472a Author: Liang-Chi Hsieh AuthorDate: Wed Jan 16 19:16:37 2019 + [SPARK-26619][SQL] Prune the unused serializers from SerializeFromObject ## What changes were proposed in this pull request? `SerializeFromObject` now keeps all serializer expressions for domain object even when only part of output attributes are used by top plan. We should be able to prune unused serializers from `SerializeFromObject` in such case. ## How was this patch tested? Added tests. Closes #23562 from viirya/SPARK-26619. Authored-by: Liang-Chi Hsieh Signed-off-by: DB Tsai --- .../org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 5 + .../spark/sql/catalyst/optimizer/ColumnPruningSuite.scala | 9 + .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala| 11 +++ 3 files changed, 25 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index d92f7f8..20f1221 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -561,6 +561,11 @@ object ColumnPruning extends Rule[LogicalPlan] { case d @ DeserializeToObject(_, _, child) if !child.outputSet.subsetOf(d.references) => d.copy(child = prunedChild(child, d.references)) +case p @ Project(_, s: SerializeFromObject) if p.references != s.outputSet => + val usedRefs = p.references + val prunedSerializer = s.serializer.filter(usedRefs.contains) + p.copy(child = SerializeFromObject(prunedSerializer, s.child)) + // Prunes the unused columns from child of Aggregate/Expand/Generate/ScriptTransformation case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) => a.copy(child = prunedChild(child, a.references)) diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala index 0cd6e09..73112e3 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala @@ -400,5 +400,14 @@ class ColumnPruningSuite extends PlanTest { comparePlans(optimized, expected) } + test("SPARK-26619: Prune the unused serializers from SerializeFromObject") { +val testRelation = LocalRelation('_1.int, '_2.int) +val serializerObject = CatalystSerde.serialize[(Int, Int)]( + CatalystSerde.deserialize[(Int, Int)](testRelation)) +val query = serializerObject.select('_1) +val optimized = Optimize.execute(query.analyze) +val expected = serializerObject.copy(serializer = Seq(serializerObject.serializer.head)).analyze +comparePlans(optimized, expected) + } // todo: add more tests for column pruning } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index c90b158..fb8239e 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -24,6 +24,7 @@ import org.apache.spark.SparkException import org.apache.spark.sql.catalyst.ScroogeLikeExample import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder} import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi} +import org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject import org.apache.spark.sql.catalyst.util.sideBySide import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec} import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, ShuffleExchangeExec} @@ -1667,6 +1668,16 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val exceptDF = inputDF.filter(col("a").isin("0") or col("b") > "c") checkAnswer(inputDF.except(exceptDF), Seq(Row("1", null))) } + + test("SPARK-26619: Prune the unused serializers from SerializeFromObjec") { +val data = Seq(("a", 1), ("b", 2), ("c", 3)) +val ds = data.toDS().map(t => (t._1, t._2 + 1)).select("_1") +val serializer = ds.queryExecution.optimizedPlan.collect { + case s: SerializeFromObject => s +}.head +assert(serializer.serializer.size == 1) +checkAnswer(ds, Seq(Row("a"),
svn commit: r31992 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_10_30-190814e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 18:42:49 2019 New Revision: 31992 Log: Apache Spark 3.0.0-SNAPSHOT-2019_01_16_10_30-190814e docs [This commit notification would consist of 1777 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated: Revert "[SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream"
This is an automated email from the ASF dual-hosted git repository. zsxwing pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new c0fc6d0 Revert "[SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream" c0fc6d0 is described below commit c0fc6d0d8dbd890a817176eb1da6e98252c2e0c0 Author: Shixiong Zhu AuthorDate: Wed Jan 16 10:03:21 2019 -0800 Revert "[SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream" This reverts commit 5a50ae37f4c41099c174459603966ee25f21ac75. --- .../execution/streaming/FileStreamSourceLog.scala | 4 +- .../sql/execution/streaming/HDFSMetadataLog.scala | 3 +- .../execution/streaming/HDFSMetadataLogSuite.scala | 6 -- .../sql/streaming/FileStreamSourceSuite.scala | 75 ++ 4 files changed, 8 insertions(+), 80 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala index 7b2ea96..8628471 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala @@ -117,9 +117,7 @@ class FileStreamSourceLog( val batches = (existedBatches ++ retrievedBatches).map(i => i._1 -> i._2.get).toArray.sortBy(_._1) -if (startBatchId <= endBatchId) { - HDFSMetadataLog.verifyBatchIds(batches.map(_._1), startId, endId) -} +HDFSMetadataLog.verifyBatchIds(batches.map(_._1), startId, endId) batches } } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index d4cfbb3..00bc215 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -457,8 +457,7 @@ object HDFSMetadataLog { } /** - * Verify if batchIds are continuous and between `startId` and `endId` (both inclusive and - * startId assumed to be <= endId). + * Verify if batchIds are continuous and between `startId` and `endId`. * * @param batchIds the sorted ids to verify. * @param startId the start id. If it's set, batchIds should start with this id. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala index 57a0343..4677769 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala @@ -275,12 +275,6 @@ class HDFSMetadataLogSuite extends SparkFunSuite with SharedSQLContext { intercept[IllegalStateException](verifyBatchIds(Seq(2, 3, 4), None, Some(5L))) intercept[IllegalStateException](verifyBatchIds(Seq(2, 3, 4), Some(1L), Some(5L))) intercept[IllegalStateException](verifyBatchIds(Seq(1, 2, 4, 5), Some(1L), Some(5L))) - -// Related to SPARK-26629, this capatures the behavior for verifyBatchIds when startId > endId -intercept[IllegalStateException](verifyBatchIds(Seq(), Some(2L), Some(1L))) -intercept[AssertionError](verifyBatchIds(Seq(2), Some(2L), Some(1L))) -intercept[AssertionError](verifyBatchIds(Seq(1), Some(2L), Some(1L))) -intercept[AssertionError](verifyBatchIds(Seq(0), Some(2L), Some(1L))) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala index fb0b365..d4bd9c7 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala @@ -48,33 +48,21 @@ abstract class FileStreamSourceTest * `FileStreamSource` actually being used in the execution. */ abstract class AddFileData extends AddData { -private val _qualifiedBasePath = PrivateMethod[Path]('qualifiedBasePath) - -private def isSamePath(fileSource: FileStreamSource, srcPath: File): Boolean = { - val path = (fileSource invokePrivate _qualifiedBasePath()).toString.stripPrefix("file:") - path == srcPath.getCanonicalPath -} - override def addData(query: Option[StreamExecution]): (Source, Offset) = { require( query.nonEmpty, "Cannot add data when there is no query for finding the active file stream source")
[spark] branch master updated: [SPARK-26550][SQL] New built-in datasource - noop
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 190814e [SPARK-26550][SQL] New built-in datasource - noop 190814e is described below commit 190814e82eca3da450683685798d470712560a5d Author: Maxim Gekk AuthorDate: Wed Jan 16 19:01:58 2019 +0100 [SPARK-26550][SQL] New built-in datasource - noop ## What changes were proposed in this pull request? In the PR, I propose new built-in datasource with name `noop` which can be used in: - benchmarking to avoid additional overhead of actions and unnecessary type conversions - caching of datasets/dataframes - producing other side effects as a consequence of row materialisations like uploading data to a IO caches. ## How was this patch tested? Added a test to check that datasource rows are materialised. Closes #23471 from MaxGekk/none-datasource. Lead-authored-by: Maxim Gekk Co-authored-by: Maxim Gekk Signed-off-by: Herman van Hovell --- ...org.apache.spark.sql.sources.DataSourceRegister | 1 + .../datasources/noop/NoopDataSource.scala | 66 ++ .../sql/execution/datasources/noop/NoopSuite.scala | 62 3 files changed, 129 insertions(+) diff --git a/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister b/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister index 1b37905..7cdfddc 100644 --- a/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister +++ b/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister @@ -1,6 +1,7 @@ org.apache.spark.sql.execution.datasources.csv.CSVFileFormat org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider org.apache.spark.sql.execution.datasources.json.JsonFileFormat +org.apache.spark.sql.execution.datasources.noop.NoopDataSource org.apache.spark.sql.execution.datasources.orc.OrcFileFormat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat org.apache.spark.sql.execution.datasources.text.TextFileFormat diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala new file mode 100644 index 000..79e4c62 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.noop + +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.sources.DataSourceRegister +import org.apache.spark.sql.sources.v2._ +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType + +/** + * This is no-op datasource. It does not do anything besides consuming its input. + * This can be useful for benchmarking or to cache data without any additional overhead. + */ +class NoopDataSource + extends DataSourceV2 + with TableProvider + with DataSourceRegister { + + override def shortName(): String = "noop" + override def getTable(options: DataSourceOptions): Table = NoopTable +} + +private[noop] object NoopTable extends Table with SupportsBatchWrite { + override def newWriteBuilder(options: DataSourceOptions): WriteBuilder = NoopWriteBuilder + override def name(): String = "noop-table" + override def schema(): StructType = new StructType() +} + +private[noop] object NoopWriteBuilder extends WriteBuilder with SupportsSaveMode { + override def buildForBatch(): BatchWrite = NoopBatchWrite + override def mode(mode: SaveMode): WriteBuilder = this +} + +private[noop] object NoopBatchWrite extends BatchWrite { + override def createBatchWriterFactory(): DataWriterFactory = NoopWriterFactory + override def commit(messages: Array[WriterCommitMessage]): Unit = {} + override
[spark] branch branch-2.3 updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream
This is an automated email from the ASF dual-hosted git repository. zsxwing pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 5a50ae3 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream 5a50ae3 is described below commit 5a50ae37f4c41099c174459603966ee25f21ac75 Author: Tathagata Das AuthorDate: Wed Jan 16 09:42:14 2019 -0800 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream ## What changes were proposed in this pull request? When a streaming query has multiple file streams, and there is a batch where one of the file streams dont have data in that batch, then if the query has to restart from that, it will throw the following error. ``` java.lang.IllegalStateException: batch 1 doesn't exist at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120) at org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205) ``` Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` list was empty. In the context of `FileStreamSource.getBatch` (where verify is called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually okay because, in a streaming query with one file stream, the `batchIds` can never be empty: - A batch is planned only when the `FileStreamSourceLog` has seen new offset (that is, there are new data files). - So `FileStreamSource.getBatch` will be called on X to Y where X will always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with X+1-Y ids. For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when there are two file stream sources, as a batch may be planned even when only one of the file streams has data. So one of the file stream may not have data, which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = Seq.empty, start = X+1, end = X)` -> failure. Note that `FileStreamSource.getBatch(X, X)` gets called **only when restarting a query in a batch where a file source did not have data**. This is because in normal planning of batches, `MicroBatchExecution` avoids calling `FileStreamSource.getBatch(X, X)` when offset X
[spark] branch branch-2.4 updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream
This is an automated email from the ASF dual-hosted git repository. zsxwing pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1843c16 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream 1843c16 is described below commit 1843c16fda09a3e9373e8f7b3ff5f73455c50442 Author: Tathagata Das AuthorDate: Wed Jan 16 09:42:14 2019 -0800 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream ## What changes were proposed in this pull request? When a streaming query has multiple file streams, and there is a batch where one of the file streams dont have data in that batch, then if the query has to restart from that, it will throw the following error. ``` java.lang.IllegalStateException: batch 1 doesn't exist at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120) at org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205) ``` Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` list was empty. In the context of `FileStreamSource.getBatch` (where verify is called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually okay because, in a streaming query with one file stream, the `batchIds` can never be empty: - A batch is planned only when the `FileStreamSourceLog` has seen new offset (that is, there are new data files). - So `FileStreamSource.getBatch` will be called on X to Y where X will always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with X+1-Y ids. For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when there are two file stream sources, as a batch may be planned even when only one of the file streams has data. So one of the file stream may not have data, which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = Seq.empty, start = X+1, end = X)` -> failure. Note that `FileStreamSource.getBatch(X, X)` gets called **only when restarting a query in a batch where a file source did not have data**. This is because in normal planning of batches, `MicroBatchExecution` avoids calling `FileStreamSource.getBatch(X, X)` when offset X
[spark] branch master updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream
This is an automated email from the ASF dual-hosted git repository. zsxwing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 06d5b17 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream 06d5b17 is described below commit 06d5b173b687c23aa53e293ed6e12ec746393876 Author: Tathagata Das AuthorDate: Wed Jan 16 09:42:14 2019 -0800 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream ## What changes were proposed in this pull request? When a streaming query has multiple file streams, and there is a batch where one of the file streams dont have data in that batch, then if the query has to restart from that, it will throw the following error. ``` java.lang.IllegalStateException: batch 1 doesn't exist at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120) at org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205) ``` Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` list was empty. In the context of `FileStreamSource.getBatch` (where verify is called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually okay because, in a streaming query with one file stream, the `batchIds` can never be empty: - A batch is planned only when the `FileStreamSourceLog` has seen new offset (that is, there are new data files). - So `FileStreamSource.getBatch` will be called on X to Y where X will always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with X+1-Y ids. For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when there are two file stream sources, as a batch may be planned even when only one of the file streams has data. So one of the file stream may not have data, which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = Seq.empty, start = X+1, end = X)` -> failure. Note that `FileStreamSource.getBatch(X, X)` gets called **only when restarting a query in a batch where a file source did not have data**. This is because in normal planning of batches, `MicroBatchExecution` avoids calling `FileStreamSource.getBatch(X, X)` when offset X has not
svn commit: r31991 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_08_02-3337477-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 16:18:15 2019 New Revision: 31991 Log: Apache Spark 2.4.1-SNAPSHOT-2019_01_16_08_02-3337477 docs [This commit notification would consist of 1476 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r31990 - in /dev/spark/2.3.4-SNAPSHOT-2019_01_16_08_01-8319ba7-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed Jan 16 16:16:52 2019 New Revision: 31990 Log: Apache Spark 2.3.4-SNAPSHOT-2019_01_16_08_01-8319ba7 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 3337477 [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing 3337477 is described below commit 3337477b759433f56d2a43be596196479f2b00de Author: Hyukjin Kwon AuthorDate: Wed Jan 16 23:25:57 2019 +0800 [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing This PR proposes to explicitly document that SparkContext cannot be shared for multiprocessing, and multi-processing execution is not guaranteed in PySpark. I have seen some cases that users attempt to use multiple processes via `multiprocessing` module time to time. For instance, see the example in the JIRA (https://issues.apache.org/jira/browse/SPARK-25992). Py4J itself does not support Python's multiprocessing out of the box (sharing the same JavaGateways for instance). In general, such pattern can cause errors with somewhat arbitrary symptoms difficult to diagnose. For instance, see the error message in JIRA: ``` Traceback (most recent call last): File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in __init__ self.handle() File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, in handle _accumulatorRegistry[aid] += update KeyError: 0 ``` The root cause of this was because global `_accumulatorRegistry` is not shared across processes. Using thread instead of process is quite easy in Python. See `threading` vs `multiprocessing` in Python - they can be usually direct replacement for each other. For instance, Python also support threadpool as well (`multiprocessing.pool.ThreadPool`) which can be direct replacement of process-based thread pool (`multiprocessing.Pool`). Manually tested, and manually built the doc. Closes #23564 from HyukjinKwon/SPARK-25992. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0) Signed-off-by: Hyukjin Kwon --- python/pyspark/context.py | 4 1 file changed, 4 insertions(+) diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 6d99e98..aff3635 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -63,6 +63,10 @@ class SparkContext(object): Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create L{RDD} and broadcast variables on that cluster. + +.. note:: :class:`SparkContext` instance is not supported to share across multiple +processes out of the box, and PySpark does not guarantee multi-processing execution. +Use threads instead for concurrent processing purpose. """ _gateway = None - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 670bc55 [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing 670bc55 is described below commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0 Author: Hyukjin Kwon AuthorDate: Wed Jan 16 23:25:57 2019 +0800 [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing ## What changes were proposed in this pull request? This PR proposes to explicitly document that SparkContext cannot be shared for multiprocessing, and multi-processing execution is not guaranteed in PySpark. I have seen some cases that users attempt to use multiple processes via `multiprocessing` module time to time. For instance, see the example in the JIRA (https://issues.apache.org/jira/browse/SPARK-25992). Py4J itself does not support Python's multiprocessing out of the box (sharing the same JavaGateways for instance). In general, such pattern can cause errors with somewhat arbitrary symptoms difficult to diagnose. For instance, see the error message in JIRA: ``` Traceback (most recent call last): File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in __init__ self.handle() File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, in handle _accumulatorRegistry[aid] += update KeyError: 0 ``` The root cause of this was because global `_accumulatorRegistry` is not shared across processes. Using thread instead of process is quite easy in Python. See `threading` vs `multiprocessing` in Python - they can be usually direct replacement for each other. For instance, Python also support threadpool as well (`multiprocessing.pool.ThreadPool`) which can be direct replacement of process-based thread pool (`multiprocessing.Pool`). ## How was this patch tested? Manually tested, and manually built the doc. Closes #23564 from HyukjinKwon/SPARK-25992. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/context.py | 4 1 file changed, 4 insertions(+) diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 316fbc8..e792300 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -66,6 +66,10 @@ class SparkContext(object): .. note:: Only one :class:`SparkContext` should be active per JVM. You must `stop()` the active :class:`SparkContext` before creating a new one. + +.. note:: :class:`SparkContext` instance is not supported to share across multiple +processes out of the box, and PySpark does not guarantee multi-processing execution. +Use threads instead for concurrent processing purpose. """ _gateway = None - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new e52acc2 [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page e52acc2 is described below commit e52acc2afcba8662b337b42e44a23ef118deea0f Author: Hyukjin Kwon AuthorDate: Wed Jan 16 23:23:36 2019 +0800 [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page ## What changes were proposed in this pull request? This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python API main page. **Before:** ![screen shot 2019-01-16 at 5 30 19 pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png) **After:** ![screen shot 2019-01-16 at 5 29 54 pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png) ## How was this patch tested? Manually checked the doc after building it. I also checked by `grep -r "SQLContext"` and looks this is the only instance left. Closes #23565 from HyukjinKwon/minor-doc-change. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit e92088de4d6755f975eb8b44b4d75b81e5a0720e) Signed-off-by: Hyukjin Kwon --- python/docs/index.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/docs/index.rst b/python/docs/index.rst index 421c8de..0e7b623 100644 --- a/python/docs/index.rst +++ b/python/docs/index.rst @@ -37,7 +37,7 @@ Core classes: A Discretized Stream (DStream), the basic abstraction in Spark Streaming. -:class:`pyspark.sql.SQLContext` +:class:`pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e92088d [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page e92088d is described below commit e92088de4d6755f975eb8b44b4d75b81e5a0720e Author: Hyukjin Kwon AuthorDate: Wed Jan 16 23:23:36 2019 +0800 [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page ## What changes were proposed in this pull request? This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python API main page. **Before:** ![screen shot 2019-01-16 at 5 30 19 pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png) **After:** ![screen shot 2019-01-16 at 5 29 54 pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png) ## How was this patch tested? Manually checked the doc after building it. I also checked by `grep -r "SQLContext"` and looks this is the only instance left. Closes #23565 from HyukjinKwon/minor-doc-change. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/docs/index.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/docs/index.rst b/python/docs/index.rst index 421c8de..0e7b623 100644 --- a/python/docs/index.rst +++ b/python/docs/index.rst @@ -37,7 +37,7 @@ Core classes: A Discretized Stream (DStream), the basic abstraction in Spark Streaming. -:class:`pyspark.sql.SQLContext` +:class:`pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 22ab94f [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests 22ab94f is described below commit 22ab94f97cec22086db287d64a05efc3a177f4c2 Author: “attilapiros” AuthorDate: Wed Jan 16 09:00:21 2019 -0600 [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests ## What changes were proposed in this pull request? Fixing resource leaks where TransportClient/TransportServer instances are not closed properly. In StandaloneSchedulerBackend the null check is added because during the SparkContextSchedulerCreationSuite #"local-cluster" test it turned out that client is not initialised as org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend#start isn't called. It throw an NPE and some resource remained in open. ## How was this patch tested? By executing the unittests and using some extra temporary logging for counting created and closed TransportClient/TransportServer instances. Closes #23540 from attilapiros/leaks. Authored-by: “attilapiros” Signed-off-by: Sean Owen (cherry picked from commit 819e5ea7c290f842c51ead8b4a6593678aeef6bf) Signed-off-by: Sean Owen --- .../cluster/StandaloneSchedulerBackend.scala | 5 +- .../spark/SparkContextSchedulerCreationSuite.scala | 103 --- .../spark/deploy/client/AppClientSuite.scala | 75 ++- .../apache/spark/deploy/master/MasterSuite.scala | 111 + .../apache/spark/storage/BlockManagerSuite.scala | 138 + 5 files changed, 228 insertions(+), 204 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala index f73a58f..6df821f 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala @@ -223,8 +223,9 @@ private[spark] class StandaloneSchedulerBackend( if (stopping.compareAndSet(false, true)) { try { super.stop() -client.stop() - +if (client != null) { + client.stop() +} val callback = shutdownCallback if (callback != null) { callback(this) diff --git a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala index f8938df..811b975 100644 --- a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala +++ b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala @@ -23,110 +23,129 @@ import org.apache.spark.internal.Logging import org.apache.spark.scheduler.{SchedulerBackend, TaskScheduler, TaskSchedulerImpl} import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend import org.apache.spark.scheduler.local.LocalSchedulerBackend +import org.apache.spark.util.Utils class SparkContextSchedulerCreationSuite extends SparkFunSuite with LocalSparkContext with PrivateMethodTester with Logging { - def createTaskScheduler(master: String): TaskSchedulerImpl = -createTaskScheduler(master, "client") + def noOp(taskSchedulerImpl: TaskSchedulerImpl): Unit = {} - def createTaskScheduler(master: String, deployMode: String): TaskSchedulerImpl = -createTaskScheduler(master, deployMode, new SparkConf()) + def createTaskScheduler(master: String)(body: TaskSchedulerImpl => Unit = noOp): Unit = +createTaskScheduler(master, "client")(body) + + def createTaskScheduler(master: String, deployMode: String)( + body: TaskSchedulerImpl => Unit): Unit = +createTaskScheduler(master, deployMode, new SparkConf())(body) def createTaskScheduler( master: String, deployMode: String, - conf: SparkConf): TaskSchedulerImpl = { + conf: SparkConf)(body: TaskSchedulerImpl => Unit): Unit = { // Create local SparkContext to setup a SparkEnv. We don't actually want to start() the // real schedulers, so we don't want to create a full SparkContext with the desired scheduler. sc = new SparkContext("local", "test", conf) val createTaskSchedulerMethod = PrivateMethod[Tuple2[SchedulerBackend, TaskScheduler]]('createTaskScheduler) -val (_, sched) = SparkContext invokePrivate createTaskSchedulerMethod(sc, master, deployMode) -sched.asInstanceOf[TaskSchedulerImpl] +val (_, sched) = + SparkContext invokePrivate createTaskSchedulerMethod(sc, master, deployMode) +try { +
[spark] branch master updated: [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 819e5ea [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests 819e5ea is described below commit 819e5ea7c290f842c51ead8b4a6593678aeef6bf Author: “attilapiros” AuthorDate: Wed Jan 16 09:00:21 2019 -0600 [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests ## What changes were proposed in this pull request? Fixing resource leaks where TransportClient/TransportServer instances are not closed properly. In StandaloneSchedulerBackend the null check is added because during the SparkContextSchedulerCreationSuite #"local-cluster" test it turned out that client is not initialised as org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend#start isn't called. It throw an NPE and some resource remained in open. ## How was this patch tested? By executing the unittests and using some extra temporary logging for counting created and closed TransportClient/TransportServer instances. Closes #23540 from attilapiros/leaks. Authored-by: “attilapiros” Signed-off-by: Sean Owen --- .../cluster/StandaloneSchedulerBackend.scala | 5 +- .../spark/SparkContextSchedulerCreationSuite.scala | 103 --- .../spark/deploy/client/AppClientSuite.scala | 75 ++- .../apache/spark/deploy/master/MasterSuite.scala | 111 + .../apache/spark/storage/BlockManagerSuite.scala | 138 + 5 files changed, 228 insertions(+), 204 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala index 66080b6..e0605fe 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala @@ -224,8 +224,9 @@ private[spark] class StandaloneSchedulerBackend( if (stopping.compareAndSet(false, true)) { try { super.stop() -client.stop() - +if (client != null) { + client.stop() +} val callback = shutdownCallback if (callback != null) { callback(this) diff --git a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala index f8938df..811b975 100644 --- a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala +++ b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala @@ -23,110 +23,129 @@ import org.apache.spark.internal.Logging import org.apache.spark.scheduler.{SchedulerBackend, TaskScheduler, TaskSchedulerImpl} import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend import org.apache.spark.scheduler.local.LocalSchedulerBackend +import org.apache.spark.util.Utils class SparkContextSchedulerCreationSuite extends SparkFunSuite with LocalSparkContext with PrivateMethodTester with Logging { - def createTaskScheduler(master: String): TaskSchedulerImpl = -createTaskScheduler(master, "client") + def noOp(taskSchedulerImpl: TaskSchedulerImpl): Unit = {} - def createTaskScheduler(master: String, deployMode: String): TaskSchedulerImpl = -createTaskScheduler(master, deployMode, new SparkConf()) + def createTaskScheduler(master: String)(body: TaskSchedulerImpl => Unit = noOp): Unit = +createTaskScheduler(master, "client")(body) + + def createTaskScheduler(master: String, deployMode: String)( + body: TaskSchedulerImpl => Unit): Unit = +createTaskScheduler(master, deployMode, new SparkConf())(body) def createTaskScheduler( master: String, deployMode: String, - conf: SparkConf): TaskSchedulerImpl = { + conf: SparkConf)(body: TaskSchedulerImpl => Unit): Unit = { // Create local SparkContext to setup a SparkEnv. We don't actually want to start() the // real schedulers, so we don't want to create a full SparkContext with the desired scheduler. sc = new SparkContext("local", "test", conf) val createTaskSchedulerMethod = PrivateMethod[Tuple2[SchedulerBackend, TaskScheduler]]('createTaskScheduler) -val (_, sched) = SparkContext invokePrivate createTaskSchedulerMethod(sc, master, deployMode) -sched.asInstanceOf[TaskSchedulerImpl] +val (_, sched) = + SparkContext invokePrivate createTaskSchedulerMethod(sc, master, deployMode) +try { + body(sched.asInstanceOf[TaskSchedulerImpl]) +} finally { + Utils.tryLogNonFatalError { +sched.stop() + } +} }
[spark] branch branch-2.3 updated (18c138b -> 8319ba7)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git. from 18c138b Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table" add b5ea933 Preparing Spark release v2.3.3-rc1 new 8319ba7 Preparing development version 2.3.4-SNAPSHOT The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml| 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 41 files changed, 42 insertions(+), 42 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 01/01: Preparing development version 2.3.4-SNAPSHOT
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git commit 8319ba736c8909aa944e28d1c8de501926e9f50f Author: Takeshi Yamamuro AuthorDate: Wed Jan 16 13:22:01 2019 + Preparing development version 2.3.4-SNAPSHOT --- R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml| 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 41 files changed, 42 insertions(+), 42 deletions(-) diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 6ec4966..a82446e 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,6 +1,6 @@ Package: SparkR Type: Package -Version: 2.3.3 +Version: 2.3.4 Title: R Frontend for Apache Spark Description: Provides an R Frontend for Apache Spark. Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), diff --git a/assembly/pom.xml b/assembly/pom.xml index 6a8cd4f..612a1b8 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../pom.xml diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 6010b6e..5547e97 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 8b5d3c8..119dde2 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index dd27a24..dba5224 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index aded5e7d..56902a3 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index a50f612..5302d95 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/tags/pom.xml b/common/tags/pom.xml index 8112ca4..232ebfa 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +2.3.4-SNAPSHOT ../../pom.xml diff --git a/common/unsafe/pom.xml b/common/unsafe/pom.xml index 0d5f61f..f0baa2a 100644 --- a/common/unsafe/pom.xml +++ b/common/unsafe/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3 +
[spark] tag v2.3.3-rc1 created (now b5ea933)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to tag v2.3.3-rc1 in repository https://gitbox.apache.org/repos/asf/spark.git. at b5ea933 (commit) This tag includes the following new commits: new b5ea933 Preparing Spark release v2.3.3-rc1 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 01/01: Preparing Spark release v2.3.3-rc1
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to tag v2.3.3-rc1 in repository https://gitbox.apache.org/repos/asf/spark.git commit b5ea9330e3072e99841270b10dc1d2248127064b Author: Takeshi Yamamuro AuthorDate: Wed Jan 16 13:21:25 2019 + Preparing Spark release v2.3.3-rc1 --- assembly/pom.xml | 2 +- common/kvstore/pom.xml| 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 40 files changed, 40 insertions(+), 40 deletions(-) diff --git a/assembly/pom.xml b/assembly/pom.xml index f8b15cc..6a8cd4f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../pom.xml diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index e412a47..6010b6e 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index d8f9a3d..8b5d3c8 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index a1a4f87..dd27a24 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index e650978..aded5e7d 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 350e3cb..a50f612 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/tags/pom.xml b/common/tags/pom.xml index e7fea41..8112ca4 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/common/unsafe/pom.xml b/common/unsafe/pom.xml index 601cc5d..0d5f61f 100644 --- a/common/unsafe/pom.xml +++ b/common/unsafe/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../../pom.xml diff --git a/core/pom.xml b/core/pom.xml index 2a7e644..930128d 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.3.3-SNAPSHOT +2.3.3 ../pom.xml diff --git a/docs/_config.yml b/docs/_config.yml index 7629f5f..8e9c3b5 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -14,7 +14,7 @@ include: # These allow
[spark] tag v2.3.3-rc1 deleted (was 2e01a70)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to tag v2.3.3-rc1 in repository https://gitbox.apache.org/repos/asf/spark.git. *** WARNING: tag v2.3.3-rc1 was deleted! *** was 2e01a70 Preparing Spark release v2.3.3-rc1 The revisions that were on this tag are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated (1979712 -> 18c138b)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git. omit 1979712 [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured Streaming R tests omit 3137dca Preparing development version 2.3.4-SNAPSHOT omit 2e01a70 Preparing Spark release v2.3.3-rc1 new 2a82295 [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured Streaming R tests new 18c138b Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table" This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (1979712) \ N -- N -- N refs/heads/branch-2.3 (18c138b) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml| 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- .../apache/spark/sql/catalyst/planning/patterns.scala | 3 +++ sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- .../execution/PruneFileSourcePartitionsSuite.scala| 19 +-- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 43 files changed, 46 insertions(+), 60 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 01/02: [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured Streaming R tests
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git commit 2a82295bd4b904f50d42a7585a72f91fff75353d Author: Shixiong Zhu AuthorDate: Wed Nov 21 09:31:12 2018 +0800 [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured Streaming R tests ## What changes were proposed in this pull request? Stop the streaming query in `Specify a schema by using a DDL-formatted string when reading` to avoid outputting annoying logs. ## How was this patch tested? Jenkins Closes #23089 from zsxwing/SPARK-26120. Authored-by: Shixiong Zhu Signed-off-by: hyukjinkwon (cherry picked from commit 4b7f7ef5007c2c8a5090f22c6e08927e9f9a407b) Signed-off-by: Felix Cheung --- R/pkg/tests/fulltests/test_streaming.R | 1 + 1 file changed, 1 insertion(+) diff --git a/R/pkg/tests/fulltests/test_streaming.R b/R/pkg/tests/fulltests/test_streaming.R index bfb1a04..6f0d2ae 100644 --- a/R/pkg/tests/fulltests/test_streaming.R +++ b/R/pkg/tests/fulltests/test_streaming.R @@ -127,6 +127,7 @@ test_that("Specify a schema by using a DDL-formatted string when reading", { expect_false(awaitTermination(q, 5 * 1000)) callJMethod(q@ssq, "processAllAvailable") expect_equal(head(sql("SELECT count(*) FROM people3"))[[1]], 3) + stopQuery(q) expect_error(read.stream(path = parquetPath, schema = "name stri"), "DataType stri is not supported.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 02/02: Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table"
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git commit 18c138bf01cd43edc6db7115b90c4e7ae7126392 Author: Takeshi Yamamuro AuthorDate: Wed Jan 16 21:56:39 2019 +0900 Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table" This reverts commit 87c2c11e742a8b35699f68ec2002f817c56bef87. --- .../apache/spark/sql/catalyst/planning/patterns.scala | 3 +++ .../execution/PruneFileSourcePartitionsSuite.scala| 19 +-- 2 files changed, 4 insertions(+), 18 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala index a91063b..cc391aa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala @@ -65,6 +65,9 @@ object PhysicalOperation extends PredicateHelper { val substitutedCondition = substitute(aliases)(condition) (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) + case h: ResolvedHint => +collectProjectsAndFilters(h.child) + case other => (None, Nil, other, Map.empty) } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala index 8a9adf7..9438418 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala @@ -17,20 +17,15 @@ package org.apache.spark.sql.hive.execution -import org.scalatest.Matchers._ - import org.apache.spark.sql.QueryTest import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ -import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project, ResolvedHint} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} import org.apache.spark.sql.catalyst.rules.RuleExecutor import org.apache.spark.sql.execution.datasources.{CatalogFileIndex, HadoopFsRelation, LogicalRelation, PruneFileSourcePartitions} import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat -import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec -import org.apache.spark.sql.functions.broadcast import org.apache.spark.sql.hive.test.TestHiveSingleton -import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SQLTestUtils import org.apache.spark.sql.types.StructType @@ -96,16 +91,4 @@ class PruneFileSourcePartitionsSuite extends QueryTest with SQLTestUtils with Te assert(size2 < tableStats.get.sizeInBytes) } } - - test("SPARK-26576 Broadcast hint not applied to partitioned table") { -withTable("tbl") { - withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { -spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("tbl") -val df = spark.table("tbl") -val qe = df.join(broadcast(df), "p").queryExecution -qe.optimizedPlan.collect { case _: ResolvedHint => } should have size 1 -qe.sparkPlan.collect { case j: BroadcastHashJoinExec => j } should have size 1 - } -} - } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] packtpartner commented on issue #172: Update documentation.md
packtpartner commented on issue #172: Update documentation.md URL: https://github.com/apache/spark-website/pull/172#issuecomment-454746306 We have added a new book listing like the previous ones. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Diff for: [GitHub] wangshuo128 closed pull request #23355: [SPARK-26418][SHUFFLE] Only OpenBlocks without any ChunkFetch for one stream will cause memory leak in ExternalShuffleService
diff --git a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java index 20d840baeaf6c..8d46671f9581a 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java +++ b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java @@ -124,6 +124,9 @@ public void setClientId(String id) { * to be returned in the same order that they were requested, assuming only a single * TransportClient is used to fetch the chunks. * + * OpenBlocks and following FetchChunk requests for a stream should be sent by the same + * TransportClient to avoid potential memory leak on server side. + * * @param streamId Identifier that refers to a stream in the remote StreamManager. This should * be agreed upon by client and server beforehand. * @param chunkIndex 0-based index of the chunk to fetch diff --git a/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java b/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java index f08d8b0f984cf..43c3d23b6304d 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java +++ b/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java @@ -90,7 +90,6 @@ protected void channelRead0( ManagedBuffer buf; try { streamManager.checkAuthorization(client, msg.streamChunkId.streamId); - streamManager.registerChannel(channel, msg.streamChunkId.streamId); buf = streamManager.getChunk(msg.streamChunkId.streamId, msg.streamChunkId.chunkIndex); } catch (Exception e) { logger.error(String.format("Error opening block %s for request from %s", diff --git a/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java b/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java index 0f6a8824d95e5..3e1f77cdc20ed 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java +++ b/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java @@ -23,6 +23,7 @@ import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.atomic.AtomicLong; +import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; import io.netty.channel.Channel; import org.apache.commons.lang3.tuple.ImmutablePair; @@ -202,4 +203,8 @@ public long registerStream(String appId, Iterator buffers) { return myStreamId; } + @VisibleForTesting + public long getStreamCount() { +return streams.size(); + } } diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java index 098fa7974b87b..c59433c44c5c0 100644 --- a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java @@ -92,6 +92,7 @@ protected void handleMessage( checkAuth(client, msg.appId); long streamId = streamManager.registerStream(client.getClientId(), new ManagedBufferIterator(msg.appId, msg.execId, msg.blockIds)); +streamManager.registerChannel(client.getChannel(), streamId); if (logger.isTraceEnabled()) { logger.trace("Registered streamId {} with {} buffers for client {} from host {}", streamId, diff --git a/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java b/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java new file mode 100644 index 0..4a721260e3db3 --- /dev/null +++ b/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the