date:20190116

svn commit: r32006 - in /dev/spark/v2.3.3-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/spark

2019-01-16 Thread yamamuro

Author: yamamuro
Date: Thu Jan 17 05:16:42 2019
New Revision: 32006

Log:
Apache Spark v2.3.3-rc1 docs


[This commit notification would consist of 1447 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] felixcheung commented on issue #172: Update documentation.md

2019-01-16 Thread GitBox

felixcheung commented on issue #172: Update documentation.md
URL: https://github.com/apache/spark-website/pull/172#issuecomment-455042131
 
 
   tbh, we don't list every book though. but if you'd like, you will need to 
run the command in the PR description to re-generate the HTML, otherwise the 
book will not show up on the web page.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r32005 - /dev/spark/v2.3.3-rc1-bin/

2019-01-16 Thread yamamuro

Author: yamamuro
Date: Thu Jan 17 04:53:00 2019
New Revision: 32005

Log:
Removing RC artifacts.

Removed:
dev/spark/v2.3.3-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r32004 - /dev/spark/v2.3.3-rc1-docs/

2019-01-16 Thread yamamuro

Author: yamamuro
Date: Thu Jan 17 04:51:37 2019
New Revision: 32004

Log:
Removing RC artifacts.

Removed:
dev/spark/v2.3.3-rc1-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r32003 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_20_13-4915cb3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Thu Jan 17 04:25:17 2019
New Revision: 32003

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_16_20_13-4915cb3 docs


[This commit notification would consist of 1778 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][BUILD] ensure call to translate_component has correct number of arguments

2019-01-16 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4915cb3  [MINOR][BUILD] ensure call to translate_component has correct 
number of arguments
4915cb3 is described below

commit 4915cb3adff6f1ec9638b24bfb74f83940ec3d95
Author: wright 
AuthorDate: Wed Jan 16 21:00:58 2019 -0600

[MINOR][BUILD] ensure call to translate_component has correct number of 
arguments

## What changes were proposed in this pull request?

The call to `translate_component` only supplied 2 out of the 3 required 
arguments. I added a default empty list for the missing argument to avoid a 
run-time error.

I work for Semmle, and noticed the bug with our LGTM code analyzer:

https://lgtm.com/projects/g/apache/spark/snapshot/0655f1624ff7b73e5c8937ab9e83453a5a3a4466/files/dev/create-release/releaseutils.py?sort=name=ASC=heatmap#x1434915b6576fb40:1

## How was this patch tested?

I checked that  `./dev/run-tests` pass OK.

Closes #23567 from ipwright/wrong-number-of-arguments-fix.

Authored-by: wright 
Signed-off-by: Sean Owen 
---
 dev/create-release/releaseutils.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/create-release/releaseutils.py 
b/dev/create-release/releaseutils.py
index f273b33..a5a26ae 100755
--- a/dev/create-release/releaseutils.py
+++ b/dev/create-release/releaseutils.py
@@ -236,7 +236,7 @@ def translate_component(component, commit_hash, warnings):
 # The returned components are already filtered and translated
 def find_components(commit, commit_hash):
 components = re.findall(r"\[\w*\]", commit.lower())
-components = [translate_component(c, commit_hash)
+components = [translate_component(c, commit_hash, [])
   for c in components if c in known_components]
 return components
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories.

2019-01-16 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 38f0307  [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for 
submit categories.
38f0307 is described below

commit 38f030725c561979ca98b2a6cc7ca6c02a1f80ed
Author: Jungtaek Lim (HeartSaVioR) 
AuthorDate: Wed Jan 16 20:57:21 2019 -0600

[SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit 
categories.

## What changes were proposed in this pull request?

The PR makes hardcoded configs below to use `ConfigEntry`.

* spark.kryo
* spark.kryoserializer
* spark.serializer
* spark.jars
* spark.files
* spark.submit
* spark.deploy
* spark.worker

This patch doesn't change configs which are not relevant to SparkConf (e.g. 
system properties).

## How was this patch tested?

Existing tests.

Closes #23532 from HeartSaVioR/SPARK-26466-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/SparkConf.scala| 21 +++
 .../main/scala/org/apache/spark/SparkContext.scala |  4 +-
 .../src/main/scala/org/apache/spark/SparkEnv.scala |  9 ++-
 .../main/scala/org/apache/spark/api/r/RUtils.scala |  3 +-
 .../apache/spark/deploy/FaultToleranceTest.scala   |  6 +-
 .../org/apache/spark/deploy/SparkCuratorUtil.scala |  3 +-
 .../org/apache/spark/deploy/SparkSubmit.scala  | 29 +
 .../org/apache/spark/deploy/master/Master.scala| 48 +++
 .../spark/deploy/master/RecoveryModeFactory.scala  |  7 ++-
 .../master/ZooKeeperLeaderElectionAgent.scala  |  5 +-
 .../deploy/master/ZooKeeperPersistenceEngine.scala | 15 ++---
 .../apache/spark/deploy/worker/DriverRunner.scala  |  6 +-
 .../org/apache/spark/deploy/worker/Worker.scala| 20 +++
 .../spark/deploy/worker/WorkerArguments.scala  |  5 +-
 .../spark/deploy/worker/ui/WorkerWebUI.scala   |  2 -
 .../org/apache/spark/internal/config/Deploy.scala  | 68 ++
 .../org/apache/spark/internal/config/Kryo.scala| 57 ++
 .../org/apache/spark/internal/config/Worker.scala  | 63 
 .../org/apache/spark/internal/config/package.scala | 31 ++
 .../apache/spark/serializer/JavaSerializer.scala   |  5 +-
 .../apache/spark/serializer/KryoSerializer.scala   | 29 -
 .../main/scala/org/apache/spark/util/Utils.scala   | 12 ++--
 .../org/apache/spark/JobCancellationSuite.scala|  3 +-
 .../test/scala/org/apache/spark/ShuffleSuite.scala |  3 +-
 .../scala/org/apache/spark/SparkConfSuite.scala| 31 +-
 .../spark/api/python/PythonBroadcastSuite.scala|  3 +-
 .../apache/spark/broadcast/BroadcastSuite.scala|  3 +-
 .../org/apache/spark/deploy/SparkSubmitSuite.scala | 34 +--
 .../apache/spark/deploy/master/MasterSuite.scala   | 14 ++---
 .../deploy/master/PersistenceEngineSuite.scala |  3 +-
 .../deploy/rest/SubmitRestProtocolSuite.scala  |  3 +-
 .../apache/spark/deploy/worker/WorkerSuite.scala   |  9 +--
 .../apache/spark/scheduler/MapStatusSuite.scala|  2 +-
 .../serializer/GenericAvroSerializerSuite.scala|  3 +-
 .../apache/spark/serializer/KryoBenchmark.scala|  8 ++-
 .../spark/serializer/KryoSerializerBenchmark.scala |  8 ++-
 .../KryoSerializerDistributedSuite.scala   |  4 +-
 .../KryoSerializerResizableOutputSuite.scala   | 14 +++--
 .../spark/serializer/KryoSerializerSuite.scala | 44 +++---
 .../serializer/SerializerPropertiesSuite.scala |  3 +-
 .../serializer/UnsafeKryoSerializerSuite.scala |  6 +-
 .../spark/storage/FlatmapIteratorSuite.scala   |  4 +-
 .../scala/org/apache/spark/util/UtilsSuite.scala   |  3 +-
 .../collection/ExternalAppendOnlyMapSuite.scala|  4 +-
 .../util/collection/ExternalSorterSuite.scala  |  7 ++-
 .../apache/spark/ml/attribute/AttributeSuite.scala |  3 +-
 .../apache/spark/ml/feature/InstanceSuite.scala|  3 +-
 .../spark/ml/feature/LabeledPointSuite.scala   |  3 +-
 .../apache/spark/ml/tree/impl/TreePointSuite.scala |  3 +-
 .../spark/mllib/clustering/KMeansSuite.scala   |  3 +-
 .../apache/spark/mllib/feature/Word2VecSuite.scala | 19 --
 .../apache/spark/mllib/linalg/MatricesSuite.scala  |  3 +-
 .../apache/spark/mllib/linalg/VectorsSuite.scala   |  3 +-
 .../spark/mllib/regression/LabeledPointSuite.scala |  3 +-
 .../distribution/MultivariateGaussianSuite.scala   |  3 +-
 .../k8s/features/BasicDriverFeatureStep.scala  | 11 ++--
 .../k8s/features/DriverCommandFeatureStep.scala| 13 +++--
 .../k8s/features/BasicDriverFeatureStepSuite.scala |  6 +-
 .../integrationtest/KubernetesTestComponents.scala |  3 +-
 .../deploy/mesos/MesosClusterDispatcher.scala  |  1 +
 .../org/apache/spark/deploy/mesos/config.scala | 12

[spark] branch master updated: [SPARK-26600] Update spark-submit usage message

2019-01-16 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 272428d  [SPARK-26600] Update spark-submit usage message
272428d is described below

commit 272428db6fabf43ceaa7d0ee9297916e5f9b5c24
Author: Luca Canali 
AuthorDate: Wed Jan 16 20:55:28 2019 -0600

[SPARK-26600] Update spark-submit usage message

## What changes were proposed in this pull request?

Spark-submit usage message should be put in sync with recent changes in 
particular regarding K8S support. These are the proposed changes to the usage 
message:

--executor-cores NUM -> can be useed for Spark on YARN and K8S

--principal PRINCIPAL  and --keytab KEYTAB -> can be used for Spark on YARN 
and K8S

--total-executor-cores NUM> can be used for Spark standalone, YARN and K8S

In addition this PR proposes to remove certain implementation details from 
the --keytab argument description as the implementation details vary between 
YARN and K8S, for example.

## How was this patch tested?

Manually tested

Closes #23518 from LucaCanali/updateSparkSubmitArguments.

Authored-by: Luca Canali 
Signed-off-by: Sean Owen 
---
 .../apache/spark/deploy/SparkSubmitArguments.scala | 23 +++---
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
index 34facd5..f5e4c4a 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
@@ -576,27 +576,26 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 |  --kill SUBMISSION_IDIf given, kills the driver specified.
 |  --status SUBMISSION_ID  If given, requests the status of the 
driver specified.
 |
-| Spark standalone and Mesos only:
+| Spark standalone, Mesos and Kubernetes only:
 |  --total-executor-cores NUM  Total cores for all executors.
 |
-| Spark standalone and YARN only:
-|  --executor-cores NUMNumber of cores per executor. (Default: 
1 in YARN mode,
-|  or all available cores on the worker in 
standalone mode)
+| Spark standalone, YARN and Kubernetes only:
+|  --executor-cores NUMNumber of cores used by each executor. 
(Default: 1 in
+|  YARN and K8S modes, or all available 
cores on the worker
+|  in standalone mode).
 |
-| YARN-only:
+| Spark on YARN and Kubernetes only:
+|  --principal PRINCIPAL   Principal to be used to login to KDC.
+|  --keytab KEYTAB The full path to the file that contains 
the keytab for the
+|  principal specified above.
+|
+| Spark on YARN only:
 |  --queue QUEUE_NAME  The YARN queue to submit to (Default: 
"default").
 |  --num-executors NUM Number of executors to launch (Default: 
2).
 |  If dynamic allocation is enabled, the 
initial number of
 |  executors will be at least NUM.
 |  --archives ARCHIVES Comma separated list of archives to be 
extracted into the
 |  working directory of each executor.
-|  --principal PRINCIPAL   Principal to be used to login to KDC, 
while running on
-|  secure HDFS.
-|  --keytab KEYTAB The full path to the file that contains 
the keytab for the
-|  principal specified above. This keytab 
will be copied to
-|  the node running the Application Master 
via the Secure
-|  Distributed Cache, for renewing the 
login tickets and the
-|  delegation tokens periodically.
   """.stripMargin
 )
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r32000 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_17_46-d608325-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Thu Jan 17 02:01:07 2019
New Revision: 32000

Log:
Apache Spark 2.4.1-SNAPSHOT-2019_01_16_17_46-d608325 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r31996 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_15_24-dc3b35c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 23:36:36 2019
New Revision: 31996

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_16_15_24-dc3b35c docs


[This commit notification would consist of 1777 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream

2019-01-16 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new d608325  [SPARK-26633][REPL] Add 
ExecutorClassLoader.getResourceAsStream
d608325 is described below

commit d608325a9fbefbe3a05e2bc8f6a95b2fa9c4174b
Author: Kris Mok 
AuthorDate: Wed Jan 16 15:21:11 2019 -0800

[SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream

## What changes were proposed in this pull request?

Add `ExecutorClassLoader.getResourceAsStream`, so that classes dynamically 
generated by the REPL can be accessed by user code as `InputStream`s for 
non-class-loading purposes, such as reading the class file for extracting 
method/constructor parameter names.

Caveat: The convention in Java's `ClassLoader` is that 
`ClassLoader.getResourceAsStream()` should be considered as a convenience 
method of `ClassLoader.getResource()`, where the latter provides a `URL` for 
the resource, and the former invokes `openStream()` on it to serve the resource 
as an `InputStream`. The former should also catch `IOException` from 
`openStream()` and convert it to `null`.

This PR breaks this convention by only overriding 
`ClassLoader.getResourceAsStream()` instead of also overriding 
`ClassLoader.getResource()`, so after this PR, it would be possible to get a 
non-null result from the former, but get a null result from the latter. This 
isn't ideal, but it's sufficient to cover the main use case and practically it 
shouldn't matter.
To implement the convention properly, we'd need to register a URL protocol 
handler with Java to allow it to properly handle the `spark://` protocol, etc, 
which sounds like an overkill for the intent of this PR.

Credit goes to zsxwing for the initial investigation and fix suggestion.

## How was this patch tested?

Added new test case in `ExecutorClassLoaderSuite` and `ReplSuite`.

Closes #23558 from rednaxelafx/executorclassloader-getresourceasstream.

Authored-by: Kris Mok 
Signed-off-by: gatorsmile 
(cherry picked from commit dc3b35c5da42def803dd05e2db7506714018e27b)
Signed-off-by: gatorsmile 
---
 .../apache/spark/repl/ExecutorClassLoader.scala| 31 +++--
 .../spark/repl/ExecutorClassLoaderSuite.scala  | 11 
 .../scala/org/apache/spark/repl/ReplSuite.scala| 32 ++
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git 
a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala 
b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
index 88eb0ad..a4a11f0 100644
--- a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
+++ b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
@@ -33,8 +33,11 @@ import org.apache.spark.util.ParentClassLoader
 /**
  * A ClassLoader that reads classes from a Hadoop FileSystem or Spark RPC 
endpoint, used to load
  * classes defined by the interpreter when the REPL is used. Allows the user 
to specify if user
- * class path should be first. This class loader delegates getting/finding 
resources to parent
- * loader, which makes sense until REPL never provide resource dynamically.
+ * class path should be first.
+ * This class loader delegates getting/finding resources to parent loader, 
which makes sense because
+ * the REPL never produce resources dynamically. One exception is when getting 
a Class file as
+ * resource stream, in which case we will try to fetch the Class file in the 
same way as loading
+ * the class, so that dynamically generated Classes from the REPL can be 
picked up.
  *
  * Note: [[ClassLoader]] will preferentially load class from parent. Only when 
parent is null or
  * the load failed, that it will call the overridden `findClass` function. To 
avoid the potential
@@ -71,6 +74,30 @@ class ExecutorClassLoader(
 parentLoader.getResources(name)
   }
 
+  override def getResourceAsStream(name: String): InputStream = {
+if (userClassPathFirst) {
+  val res = getClassResourceAsStreamLocally(name)
+  if (res != null) res else parentLoader.getResourceAsStream(name)
+} else {
+  val res = parentLoader.getResourceAsStream(name)
+  if (res != null) res else getClassResourceAsStreamLocally(name)
+}
+  }
+
+  private def getClassResourceAsStreamLocally(name: String): InputStream = {
+// Class files can be dynamically generated from the REPL. Allow this 
class loader to
+// load such files for purposes other than loading the class.
+try {
+  if (name.endsWith(".class")) fetchFn(name) else null
+} catch {
+  // The helper functions referenced by fetchFn throw CNFE to indicate 
failure to fetch
+  // the class. It matches what IOException was supposed to be used for, 
and
+  //

[spark] branch master updated: [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream

2019-01-16 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dc3b35c  [SPARK-26633][REPL] Add 
ExecutorClassLoader.getResourceAsStream
dc3b35c is described below

commit dc3b35c5da42def803dd05e2db7506714018e27b
Author: Kris Mok 
AuthorDate: Wed Jan 16 15:21:11 2019 -0800

[SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream

## What changes were proposed in this pull request?

Add `ExecutorClassLoader.getResourceAsStream`, so that classes dynamically 
generated by the REPL can be accessed by user code as `InputStream`s for 
non-class-loading purposes, such as reading the class file for extracting 
method/constructor parameter names.

Caveat: The convention in Java's `ClassLoader` is that 
`ClassLoader.getResourceAsStream()` should be considered as a convenience 
method of `ClassLoader.getResource()`, where the latter provides a `URL` for 
the resource, and the former invokes `openStream()` on it to serve the resource 
as an `InputStream`. The former should also catch `IOException` from 
`openStream()` and convert it to `null`.

This PR breaks this convention by only overriding 
`ClassLoader.getResourceAsStream()` instead of also overriding 
`ClassLoader.getResource()`, so after this PR, it would be possible to get a 
non-null result from the former, but get a null result from the latter. This 
isn't ideal, but it's sufficient to cover the main use case and practically it 
shouldn't matter.
To implement the convention properly, we'd need to register a URL protocol 
handler with Java to allow it to properly handle the `spark://` protocol, etc, 
which sounds like an overkill for the intent of this PR.

Credit goes to zsxwing for the initial investigation and fix suggestion.

## How was this patch tested?

Added new test case in `ExecutorClassLoaderSuite` and `ReplSuite`.

Closes #23558 from rednaxelafx/executorclassloader-getresourceasstream.

Authored-by: Kris Mok 
Signed-off-by: gatorsmile 
---
 .../apache/spark/repl/ExecutorClassLoader.scala| 31 +++--
 .../spark/repl/ExecutorClassLoaderSuite.scala  | 11 
 .../scala/org/apache/spark/repl/ReplSuite.scala| 32 ++
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git 
a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala 
b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
index 3176502..177bce2 100644
--- a/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
+++ b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
@@ -33,8 +33,11 @@ import org.apache.spark.util.ParentClassLoader
 /**
  * A ClassLoader that reads classes from a Hadoop FileSystem or Spark RPC 
endpoint, used to load
  * classes defined by the interpreter when the REPL is used. Allows the user 
to specify if user
- * class path should be first. This class loader delegates getting/finding 
resources to parent
- * loader, which makes sense until REPL never provide resource dynamically.
+ * class path should be first.
+ * This class loader delegates getting/finding resources to parent loader, 
which makes sense because
+ * the REPL never produce resources dynamically. One exception is when getting 
a Class file as
+ * resource stream, in which case we will try to fetch the Class file in the 
same way as loading
+ * the class, so that dynamically generated Classes from the REPL can be 
picked up.
  *
  * Note: [[ClassLoader]] will preferentially load class from parent. Only when 
parent is null or
  * the load failed, that it will call the overridden `findClass` function. To 
avoid the potential
@@ -71,6 +74,30 @@ class ExecutorClassLoader(
 parentLoader.getResources(name)
   }
 
+  override def getResourceAsStream(name: String): InputStream = {
+if (userClassPathFirst) {
+  val res = getClassResourceAsStreamLocally(name)
+  if (res != null) res else parentLoader.getResourceAsStream(name)
+} else {
+  val res = parentLoader.getResourceAsStream(name)
+  if (res != null) res else getClassResourceAsStreamLocally(name)
+}
+  }
+
+  private def getClassResourceAsStreamLocally(name: String): InputStream = {
+// Class files can be dynamically generated from the REPL. Allow this 
class loader to
+// load such files for purposes other than loading the class.
+try {
+  if (name.endsWith(".class")) fetchFn(name) else null
+} catch {
+  // The helper functions referenced by fetchFn throw CNFE to indicate 
failure to fetch
+  // the class. It matches what IOException was supposed to be used for, 
and
+  // ClassLoader.getResourceAsStream() catches IOException and returns 
null in that case.
+  // So we follow that model and handle CNFE

svn commit: r31994 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_12_58-1843c16-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 21:14:11 2019
New Revision: 31994

Log:
Apache Spark 2.4.1-SNAPSHOT-2019_01_16_12_58-1843c16 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r31993 - in /dev/spark/2.3.4-SNAPSHOT-2019_01_16_12_57-c0fc6d0-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 21:12:23 2019
New Revision: 31993

Log:
Apache Spark 2.3.4-SNAPSHOT-2019_01_16_12_57-c0fc6d0 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26625] Add oauthToken to spark.redaction.regex

2019-01-16 Thread mcheah

This is an automated email from the ASF dual-hosted git repository.

mcheah pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 01301d0  [SPARK-26625] Add oauthToken to spark.redaction.regex
01301d0 is described below

commit 01301d09721cc12f1cc66ab52de3da117f5d33e6
Author: Vinoo Ganesh 
AuthorDate: Wed Jan 16 11:43:10 2019 -0800

[SPARK-26625] Add oauthToken to spark.redaction.regex

## What changes were proposed in this pull request?

The regex (spark.redaction.regex) that is used to decide which config 
properties or environment settings are sensitive should also include oauthToken 
to match  spark.kubernetes.authenticate.submission.oauthToken

## How was this patch tested?

Simple regex addition - happy to add a test if needed.

Author: Vinoo Ganesh 

Closes #23555 from vinooganesh/vinooganesh/SPARK-26625.
---
 core/src/main/scala/org/apache/spark/internal/config/package.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 95becfa..0e78637 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -512,7 +512,7 @@ package object config {
 "a property key or value, the value is redacted from the environment 
UI and various logs " +
 "like YARN and event logs.")
   .regexConf
-  .createWithDefault("(?i)secret|password".r)
+  .createWithDefault("(?i)secret|password|token".r)
 
   private[spark] val STRING_REDACTION_PATTERN =
 ConfigBuilder("spark.redaction.string.regex")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26619][SQL] Prune the unused serializers from SerializeFromObject

2019-01-16 Thread dbtsai

This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8f17078  [SPARK-26619][SQL] Prune the unused serializers from 
SerializeFromObject
8f17078 is described below

commit 8f170787d24912c3a94ce1510197820c87df472a
Author: Liang-Chi Hsieh 
AuthorDate: Wed Jan 16 19:16:37 2019 +

[SPARK-26619][SQL] Prune the unused serializers from SerializeFromObject

## What changes were proposed in this pull request?

`SerializeFromObject` now keeps all serializer expressions for domain 
object even when only part of output attributes are used by top plan.

We should be able to prune unused serializers from `SerializeFromObject` in 
such case.

## How was this patch tested?

Added tests.

Closes #23562 from viirya/SPARK-26619.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: DB Tsai 
---
 .../org/apache/spark/sql/catalyst/optimizer/Optimizer.scala   |  5 +
 .../spark/sql/catalyst/optimizer/ColumnPruningSuite.scala |  9 +
 .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala| 11 +++
 3 files changed, 25 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index d92f7f8..20f1221 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -561,6 +561,11 @@ object ColumnPruning extends Rule[LogicalPlan] {
 case d @ DeserializeToObject(_, _, child) if 
!child.outputSet.subsetOf(d.references) =>
   d.copy(child = prunedChild(child, d.references))
 
+case p @ Project(_, s: SerializeFromObject) if p.references != s.outputSet 
=>
+  val usedRefs = p.references
+  val prunedSerializer = s.serializer.filter(usedRefs.contains)
+  p.copy(child = SerializeFromObject(prunedSerializer, s.child))
+
 // Prunes the unused columns from child of 
Aggregate/Expand/Generate/ScriptTransformation
 case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) 
=>
   a.copy(child = prunedChild(child, a.references))
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
index 0cd6e09..73112e3 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
@@ -400,5 +400,14 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(optimized, expected)
   }
 
+  test("SPARK-26619: Prune the unused serializers from SerializeFromObject") {
+val testRelation = LocalRelation('_1.int, '_2.int)
+val serializerObject = CatalystSerde.serialize[(Int, Int)](
+  CatalystSerde.deserialize[(Int, Int)](testRelation))
+val query = serializerObject.select('_1)
+val optimized = Optimize.execute(query.analyze)
+val expected = serializerObject.copy(serializer = 
Seq(serializerObject.serializer.head)).analyze
+comparePlans(optimized, expected)
+  }
   // todo: add more tests for column pruning
 }
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
index c90b158..fb8239e 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
@@ -24,6 +24,7 @@ import org.apache.spark.SparkException
 import org.apache.spark.sql.catalyst.ScroogeLikeExample
 import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder}
 import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi}
+import org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
 import org.apache.spark.sql.catalyst.util.sideBySide
 import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec}
 import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, 
ShuffleExchangeExec}
@@ -1667,6 +1668,16 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
 val exceptDF = inputDF.filter(col("a").isin("0") or col("b") > "c")
 checkAnswer(inputDF.except(exceptDF), Seq(Row("1", null)))
   }
+
+  test("SPARK-26619: Prune the unused serializers from SerializeFromObjec") {
+val data = Seq(("a", 1), ("b", 2), ("c", 3))
+val ds = data.toDS().map(t => (t._1, t._2 + 1)).select("_1")
+val serializer = ds.queryExecution.optimizedPlan.collect {
+  case s: SerializeFromObject => s
+}.head
+assert(serializer.serializer.size == 1)
+checkAnswer(ds, Seq(Row("a"),

svn commit: r31992 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_16_10_30-190814e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 18:42:49 2019
New Revision: 31992

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_16_10_30-190814e docs


[This commit notification would consist of 1777 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.3 updated: Revert "[SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream"

2019-01-16 Thread zsxwing

This is an automated email from the ASF dual-hosted git repository.

zsxwing pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new c0fc6d0  Revert "[SPARK-26629][SS] Fixed error with multiple file 
stream in a query + restart on a batch that has no data for one file stream"
c0fc6d0 is described below

commit c0fc6d0d8dbd890a817176eb1da6e98252c2e0c0
Author: Shixiong Zhu 
AuthorDate: Wed Jan 16 10:03:21 2019 -0800

Revert "[SPARK-26629][SS] Fixed error with multiple file stream in a query 
+ restart on a batch that has no data for one file stream"

This reverts commit 5a50ae37f4c41099c174459603966ee25f21ac75.
---
 .../execution/streaming/FileStreamSourceLog.scala  |  4 +-
 .../sql/execution/streaming/HDFSMetadataLog.scala  |  3 +-
 .../execution/streaming/HDFSMetadataLogSuite.scala |  6 --
 .../sql/streaming/FileStreamSourceSuite.scala  | 75 ++
 4 files changed, 8 insertions(+), 80 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
index 7b2ea96..8628471 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
@@ -117,9 +117,7 @@ class FileStreamSourceLog(
 
 val batches =
   (existedBatches ++ retrievedBatches).map(i => i._1 -> 
i._2.get).toArray.sortBy(_._1)
-if (startBatchId <= endBatchId) {
-  HDFSMetadataLog.verifyBatchIds(batches.map(_._1), startId, endId)
-}
+HDFSMetadataLog.verifyBatchIds(batches.map(_._1), startId, endId)
 batches
   }
 }
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index d4cfbb3..00bc215 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -457,8 +457,7 @@ object HDFSMetadataLog {
   }
 
   /**
-   * Verify if batchIds are continuous and between `startId` and `endId` (both 
inclusive and
-   * startId assumed to be <= endId).
+   * Verify if batchIds are continuous and between `startId` and `endId`.
*
* @param batchIds the sorted ids to verify.
* @param startId the start id. If it's set, batchIds should start with this 
id.
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala
index 57a0343..4677769 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala
@@ -275,12 +275,6 @@ class HDFSMetadataLogSuite extends SparkFunSuite with 
SharedSQLContext {
 intercept[IllegalStateException](verifyBatchIds(Seq(2, 3, 4), None, 
Some(5L)))
 intercept[IllegalStateException](verifyBatchIds(Seq(2, 3, 4), Some(1L), 
Some(5L)))
 intercept[IllegalStateException](verifyBatchIds(Seq(1, 2, 4, 5), Some(1L), 
Some(5L)))
-
-// Related to SPARK-26629, this capatures the behavior for verifyBatchIds 
when startId > endId
-intercept[IllegalStateException](verifyBatchIds(Seq(), Some(2L), Some(1L)))
-intercept[AssertionError](verifyBatchIds(Seq(2), Some(2L), Some(1L)))
-intercept[AssertionError](verifyBatchIds(Seq(1), Some(2L), Some(1L)))
-intercept[AssertionError](verifyBatchIds(Seq(0), Some(2L), Some(1L)))
   }
 }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
index fb0b365..d4bd9c7 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
@@ -48,33 +48,21 @@ abstract class FileStreamSourceTest
* `FileStreamSource` actually being used in the execution.
*/
   abstract class AddFileData extends AddData {
-private val _qualifiedBasePath = PrivateMethod[Path]('qualifiedBasePath)
-
-private def isSamePath(fileSource: FileStreamSource, srcPath: File): 
Boolean = {
-  val path = (fileSource invokePrivate 
_qualifiedBasePath()).toString.stripPrefix("file:")
-  path == srcPath.getCanonicalPath
-}
-
 override def addData(query: Option[StreamExecution]): (Source, Offset) = {
   require(
 query.nonEmpty,
 "Cannot add data when there is no query for finding the active file 
stream source")

[spark] branch master updated: [SPARK-26550][SQL] New built-in datasource - noop

2019-01-16 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 190814e  [SPARK-26550][SQL] New built-in datasource - noop
190814e is described below

commit 190814e82eca3da450683685798d470712560a5d
Author: Maxim Gekk 
AuthorDate: Wed Jan 16 19:01:58 2019 +0100

[SPARK-26550][SQL] New built-in datasource - noop

## What changes were proposed in this pull request?

In the PR, I propose new built-in datasource with name `noop` which can be 
used in:
- benchmarking to avoid additional overhead of actions and unnecessary type 
conversions
- caching of datasets/dataframes
- producing other side effects as a consequence of row materialisations 
like uploading data to a IO caches.

## How was this patch tested?

Added a test to check that datasource rows are materialised.

Closes #23471 from MaxGekk/none-datasource.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: Herman van Hovell 
---
 ...org.apache.spark.sql.sources.DataSourceRegister |  1 +
 .../datasources/noop/NoopDataSource.scala  | 66 ++
 .../sql/execution/datasources/noop/NoopSuite.scala | 62 
 3 files changed, 129 insertions(+)

diff --git 
a/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
index 1b37905..7cdfddc 100644
--- 
a/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
+++ 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
@@ -1,6 +1,7 @@
 org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
 org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
 org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+org.apache.spark.sql.execution.datasources.noop.NoopDataSource
 org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
 org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
 org.apache.spark.sql.execution.datasources.text.TextFileFormat
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala
new file mode 100644
index 000..79e4c62
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.noop
+
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.sources.v2._
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+
+/**
+ * This is no-op datasource. It does not do anything besides consuming its 
input.
+ * This can be useful for benchmarking or to cache data without any additional 
overhead.
+ */
+class NoopDataSource
+  extends DataSourceV2
+  with TableProvider
+  with DataSourceRegister {
+
+  override def shortName(): String = "noop"
+  override def getTable(options: DataSourceOptions): Table = NoopTable
+}
+
+private[noop] object NoopTable extends Table with SupportsBatchWrite {
+  override def newWriteBuilder(options: DataSourceOptions): WriteBuilder = 
NoopWriteBuilder
+  override def name(): String = "noop-table"
+  override def schema(): StructType = new StructType()
+}
+
+private[noop] object NoopWriteBuilder extends WriteBuilder with 
SupportsSaveMode {
+  override def buildForBatch(): BatchWrite = NoopBatchWrite
+  override def mode(mode: SaveMode): WriteBuilder = this
+}
+
+private[noop] object NoopBatchWrite extends BatchWrite {
+  override def createBatchWriterFactory(): DataWriterFactory = 
NoopWriterFactory
+  override def commit(messages: Array[WriterCommitMessage]): Unit = {}
+  override

[spark] branch branch-2.3 updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread zsxwing

This is an automated email from the ASF dual-hosted git repository.

zsxwing pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new 5a50ae3  [SPARK-26629][SS] Fixed error with multiple file stream in a 
query + restart on a batch that has no data for one file stream
5a50ae3 is described below

commit 5a50ae37f4c41099c174459603966ee25f21ac75
Author: Tathagata Das 
AuthorDate: Wed Jan 16 09:42:14 2019 -0800

[SPARK-26629][SS] Fixed error with multiple file stream in a query + 
restart on a batch that has no data for one file stream

## What changes were proposed in this pull request?
When a streaming query has multiple file streams, and there is a batch 
where one of the file streams dont have data in that batch, then if the query 
has to restart from that, it will throw the following error.
```
java.lang.IllegalStateException: batch 1 doesn't exist
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
at 
org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
```

Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` 
list was empty. In the context of `FileStreamSource.getBatch` (where verify is 
called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually 
okay because, in a streaming query with one file stream, the `batchIds` can 
never be empty:
- A batch is planned only when the `FileStreamSourceLog` has seen new 
offset (that is, there are new data files).
- So `FileStreamSource.getBatch` will be called on X to Y where X will 
always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with 
X+1-Y ids.

For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = 
Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when 
there are two file stream sources, as a batch may be planned even when only one 
of the file streams has data. So one of the file stream may not have data, 
which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = 
Seq.empty, start = X+1, end = X)` -> failure.

Note that `FileStreamSource.getBatch(X, X)` gets called **only when 
restarting a query in a batch where a file source did not have data**. This is 
because in normal planning of batches, `MicroBatchExecution` avoids calling 
`FileStreamSource.getBatch(X, X)` when offset X

[spark] branch branch-2.4 updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread zsxwing

This is an automated email from the ASF dual-hosted git repository.

zsxwing pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1843c16  [SPARK-26629][SS] Fixed error with multiple file stream in a 
query + restart on a batch that has no data for one file stream
1843c16 is described below

commit 1843c16fda09a3e9373e8f7b3ff5f73455c50442
Author: Tathagata Das 
AuthorDate: Wed Jan 16 09:42:14 2019 -0800

[SPARK-26629][SS] Fixed error with multiple file stream in a query + 
restart on a batch that has no data for one file stream

## What changes were proposed in this pull request?
When a streaming query has multiple file streams, and there is a batch 
where one of the file streams dont have data in that batch, then if the query 
has to restart from that, it will throw the following error.
```
java.lang.IllegalStateException: batch 1 doesn't exist
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
at 
org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
```

Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` 
list was empty. In the context of `FileStreamSource.getBatch` (where verify is 
called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually 
okay because, in a streaming query with one file stream, the `batchIds` can 
never be empty:
- A batch is planned only when the `FileStreamSourceLog` has seen new 
offset (that is, there are new data files).
- So `FileStreamSource.getBatch` will be called on X to Y where X will 
always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with 
X+1-Y ids.

For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = 
Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when 
there are two file stream sources, as a batch may be planned even when only one 
of the file streams has data. So one of the file stream may not have data, 
which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = 
Seq.empty, start = X+1, end = X)` -> failure.

Note that `FileStreamSource.getBatch(X, X)` gets called **only when 
restarting a query in a batch where a file source did not have data**. This is 
because in normal planning of batches, `MicroBatchExecution` avoids calling 
`FileStreamSource.getBatch(X, X)` when offset X

[spark] branch master updated: [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread zsxwing

This is an automated email from the ASF dual-hosted git repository.

zsxwing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 06d5b17  [SPARK-26629][SS] Fixed error with multiple file stream in a 
query + restart on a batch that has no data for one file stream
06d5b17 is described below

commit 06d5b173b687c23aa53e293ed6e12ec746393876
Author: Tathagata Das 
AuthorDate: Wed Jan 16 09:42:14 2019 -0800

[SPARK-26629][SS] Fixed error with multiple file stream in a query + 
restart on a batch that has no data for one file stream

## What changes were proposed in this pull request?
When a streaming query has multiple file streams, and there is a batch 
where one of the file streams dont have data in that batch, then if the query 
has to restart from that, it will throw the following error.
```
java.lang.IllegalStateException: batch 1 doesn't exist
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
at 
org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
```

Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` 
list was empty. In the context of `FileStreamSource.getBatch` (where verify is 
called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually 
okay because, in a streaming query with one file stream, the `batchIds` can 
never be empty:
- A batch is planned only when the `FileStreamSourceLog` has seen new 
offset (that is, there are new data files).
- So `FileStreamSource.getBatch` will be called on X to Y where X will 
always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with 
X+1-Y ids.

For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = 
Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when 
there are two file stream sources, as a batch may be planned even when only one 
of the file streams has data. So one of the file stream may not have data, 
which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = 
Seq.empty, start = X+1, end = X)` -> failure.

Note that `FileStreamSource.getBatch(X, X)` gets called **only when 
restarting a query in a batch where a file source did not have data**. This is 
because in normal planning of batches, `MicroBatchExecution` avoids calling 
`FileStreamSource.getBatch(X, X)` when offset X has not

svn commit: r31991 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_16_08_02-3337477-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 16:18:15 2019
New Revision: 31991

Log:
Apache Spark 2.4.1-SNAPSHOT-2019_01_16_08_02-3337477 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r31990 - in /dev/spark/2.3.4-SNAPSHOT-2019_01_16_08_01-8319ba7-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-16 Thread pwendell

Author: pwendell
Date: Wed Jan 16 16:16:52 2019
New Revision: 31990

Log:
Apache Spark 2.3.4-SNAPSHOT-2019_01_16_08_01-8319ba7 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing

2019-01-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 3337477  [SPARK-25992][PYTHON] Document SparkContext cannot be shared 
for multiprocessing
3337477 is described below

commit 3337477b759433f56d2a43be596196479f2b00de
Author: Hyukjin Kwon 
AuthorDate: Wed Jan 16 23:25:57 2019 +0800

[SPARK-25992][PYTHON] Document SparkContext cannot be shared for 
multiprocessing

This PR proposes to explicitly document that SparkContext cannot be shared 
for multiprocessing, and multi-processing execution is not guaranteed in 
PySpark.

I have seen some cases that users attempt to use multiple processes via 
`multiprocessing` module time to time. For instance, see the example in the 
JIRA (https://issues.apache.org/jira/browse/SPARK-25992).

Py4J itself does not support Python's multiprocessing out of the box 
(sharing the same JavaGateways for instance).

In general, such pattern can cause errors with somewhat arbitrary symptoms 
difficult to diagnose. For instance, see the error message in JIRA:

```
Traceback (most recent call last):
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, 
in _handle_request_noblock
self.process_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, 
in process_request
self.finish_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, 
in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, 
in __init__
self.handle()
File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 
238, in handle
_accumulatorRegistry[aid] += update
KeyError: 0
```

The root cause of this was because global `_accumulatorRegistry` is not 
shared across processes.

Using thread instead of process is quite easy in Python. See `threading` vs 
`multiprocessing` in Python - they can be usually direct replacement for each 
other. For instance, Python also support threadpool as well 
(`multiprocessing.pool.ThreadPool`) which can be direct replacement of 
process-based thread pool (`multiprocessing.Pool`).

Manually tested, and manually built the doc.

Closes #23564 from HyukjinKwon/SPARK-25992.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/context.py | 4 
 1 file changed, 4 insertions(+)

diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 6d99e98..aff3635 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -63,6 +63,10 @@ class SparkContext(object):
 Main entry point for Spark functionality. A SparkContext represents the
 connection to a Spark cluster, and can be used to create L{RDD} and
 broadcast variables on that cluster.
+
+.. note:: :class:`SparkContext` instance is not supported to share across 
multiple
+processes out of the box, and PySpark does not guarantee 
multi-processing execution.
+Use threads instead for concurrent processing purpose.
 """
 
 _gateway = None


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing

2019-01-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 670bc55  [SPARK-25992][PYTHON] Document SparkContext cannot be shared 
for multiprocessing
670bc55 is described below

commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0
Author: Hyukjin Kwon 
AuthorDate: Wed Jan 16 23:25:57 2019 +0800

[SPARK-25992][PYTHON] Document SparkContext cannot be shared for 
multiprocessing

## What changes were proposed in this pull request?

This PR proposes to explicitly document that SparkContext cannot be shared 
for multiprocessing, and multi-processing execution is not guaranteed in 
PySpark.

I have seen some cases that users attempt to use multiple processes via 
`multiprocessing` module time to time. For instance, see the example in the 
JIRA (https://issues.apache.org/jira/browse/SPARK-25992).

Py4J itself does not support Python's multiprocessing out of the box 
(sharing the same JavaGateways for instance).

In general, such pattern can cause errors with somewhat arbitrary symptoms 
difficult to diagnose. For instance, see the error message in JIRA:

```
Traceback (most recent call last):
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, 
in _handle_request_noblock
self.process_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, 
in process_request
self.finish_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, 
in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, 
in __init__
self.handle()
File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 
238, in handle
_accumulatorRegistry[aid] += update
KeyError: 0
```

The root cause of this was because global `_accumulatorRegistry` is not 
shared across processes.

Using thread instead of process is quite easy in Python. See `threading` vs 
`multiprocessing` in Python - they can be usually direct replacement for each 
other. For instance, Python also support threadpool as well 
(`multiprocessing.pool.ThreadPool`) which can be direct replacement of 
process-based thread pool (`multiprocessing.Pool`).

## How was this patch tested?

Manually tested, and manually built the doc.

Closes #23564 from HyukjinKwon/SPARK-25992.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/context.py | 4 
 1 file changed, 4 insertions(+)

diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 316fbc8..e792300 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -66,6 +66,10 @@ class SparkContext(object):
 
 .. note:: Only one :class:`SparkContext` should be active per JVM. You 
must `stop()`
 the active :class:`SparkContext` before creating a new one.
+
+.. note:: :class:`SparkContext` instance is not supported to share across 
multiple
+processes out of the box, and PySpark does not guarantee 
multi-processing execution.
+Use threads instead for concurrent processing purpose.
 """
 
 _gateway = None


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page

2019-01-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new e52acc2  [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API 
main page
e52acc2 is described below

commit e52acc2afcba8662b337b42e44a23ef118deea0f
Author: Hyukjin Kwon 
AuthorDate: Wed Jan 16 23:23:36 2019 +0800

[MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page

## What changes were proposed in this pull request?

This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python 
API main page.

**Before:**

![screen shot 2019-01-16 at 5 30 19 
pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png)

**After:**

![screen shot 2019-01-16 at 5 29 54 
pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png)

## How was this patch tested?

Manually checked the doc after building it.
I also checked by `grep -r "SQLContext"` and looks this is the only 
instance left.

Closes #23565 from HyukjinKwon/minor-doc-change.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit e92088de4d6755f975eb8b44b4d75b81e5a0720e)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/docs/index.rst b/python/docs/index.rst
index 421c8de..0e7b623 100644
--- a/python/docs/index.rst
+++ b/python/docs/index.rst
@@ -37,7 +37,7 @@ Core classes:
 
 A Discretized Stream (DStream), the basic abstraction in Spark Streaming.
 
-:class:`pyspark.sql.SQLContext`
+:class:`pyspark.sql.SparkSession`
 
 Main entry point for DataFrame and SQL functionality.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page

2019-01-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e92088d  [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API 
main page
e92088d is described below

commit e92088de4d6755f975eb8b44b4d75b81e5a0720e
Author: Hyukjin Kwon 
AuthorDate: Wed Jan 16 23:23:36 2019 +0800

[MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page

## What changes were proposed in this pull request?

This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python 
API main page.

**Before:**

![screen shot 2019-01-16 at 5 30 19 
pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png)

**After:**

![screen shot 2019-01-16 at 5 29 54 
pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png)

## How was this patch tested?

Manually checked the doc after building it.
I also checked by `grep -r "SQLContext"` and looks this is the only 
instance left.

Closes #23565 from HyukjinKwon/minor-doc-change.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/docs/index.rst b/python/docs/index.rst
index 421c8de..0e7b623 100644
--- a/python/docs/index.rst
+++ b/python/docs/index.rst
@@ -37,7 +37,7 @@ Core classes:
 
 A Discretized Stream (DStream), the basic abstraction in Spark Streaming.
 
-:class:`pyspark.sql.SQLContext`
+:class:`pyspark.sql.SparkSession`
 
 Main entry point for DataFrame and SQL functionality.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests

2019-01-16 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 22ab94f  [SPARK-26615][CORE] Fixing transport server/client resource 
leaks in the core unittests
22ab94f is described below

commit 22ab94f97cec22086db287d64a05efc3a177f4c2
Author: “attilapiros” 
AuthorDate: Wed Jan 16 09:00:21 2019 -0600

[SPARK-26615][CORE] Fixing transport server/client resource leaks in the 
core unittests

## What changes were proposed in this pull request?

Fixing resource leaks where TransportClient/TransportServer instances are 
not closed properly.

In StandaloneSchedulerBackend the null check is added because during the 
SparkContextSchedulerCreationSuite #"local-cluster" test it turned out that 
client is not initialised as 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend#start isn't 
called. It throw an NPE and some resource remained in open.

## How was this patch tested?

By executing the unittests and using some extra temporary logging for 
counting created and closed TransportClient/TransportServer instances.

Closes #23540 from attilapiros/leaks.

Authored-by: “attilapiros” 
Signed-off-by: Sean Owen 
(cherry picked from commit 819e5ea7c290f842c51ead8b4a6593678aeef6bf)
Signed-off-by: Sean Owen 
---
 .../cluster/StandaloneSchedulerBackend.scala   |   5 +-
 .../spark/SparkContextSchedulerCreationSuite.scala | 103 ---
 .../spark/deploy/client/AppClientSuite.scala   |  75 ++-
 .../apache/spark/deploy/master/MasterSuite.scala   | 111 +
 .../apache/spark/storage/BlockManagerSuite.scala   | 138 +
 5 files changed, 228 insertions(+), 204 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
index f73a58f..6df821f 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
@@ -223,8 +223,9 @@ private[spark] class StandaloneSchedulerBackend(
 if (stopping.compareAndSet(false, true)) {
   try {
 super.stop()
-client.stop()
-
+if (client != null) {
+  client.stop()
+}
 val callback = shutdownCallback
 if (callback != null) {
   callback(this)
diff --git 
a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
index f8938df..811b975 100644
--- 
a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
@@ -23,110 +23,129 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.scheduler.{SchedulerBackend, TaskScheduler, 
TaskSchedulerImpl}
 import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
 import org.apache.spark.scheduler.local.LocalSchedulerBackend
+import org.apache.spark.util.Utils
 
 
 class SparkContextSchedulerCreationSuite
   extends SparkFunSuite with LocalSparkContext with PrivateMethodTester with 
Logging {
 
-  def createTaskScheduler(master: String): TaskSchedulerImpl =
-createTaskScheduler(master, "client")
+  def noOp(taskSchedulerImpl: TaskSchedulerImpl): Unit = {}
 
-  def createTaskScheduler(master: String, deployMode: String): 
TaskSchedulerImpl =
-createTaskScheduler(master, deployMode, new SparkConf())
+  def createTaskScheduler(master: String)(body: TaskSchedulerImpl => Unit = 
noOp): Unit =
+createTaskScheduler(master, "client")(body)
+
+  def createTaskScheduler(master: String, deployMode: String)(
+  body: TaskSchedulerImpl => Unit): Unit =
+createTaskScheduler(master, deployMode, new SparkConf())(body)
 
   def createTaskScheduler(
   master: String,
   deployMode: String,
-  conf: SparkConf): TaskSchedulerImpl = {
+  conf: SparkConf)(body: TaskSchedulerImpl => Unit): Unit = {
 // Create local SparkContext to setup a SparkEnv. We don't actually want 
to start() the
 // real schedulers, so we don't want to create a full SparkContext with 
the desired scheduler.
 sc = new SparkContext("local", "test", conf)
 val createTaskSchedulerMethod =
   PrivateMethod[Tuple2[SchedulerBackend, 
TaskScheduler]]('createTaskScheduler)
-val (_, sched) = SparkContext invokePrivate createTaskSchedulerMethod(sc, 
master, deployMode)
-sched.asInstanceOf[TaskSchedulerImpl]
+val (_, sched) =
+  SparkContext invokePrivate createTaskSchedulerMethod(sc, master, 
deployMode)
+try {
+

[spark] branch master updated: [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests

2019-01-16 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 819e5ea  [SPARK-26615][CORE] Fixing transport server/client resource 
leaks in the core unittests
819e5ea is described below

commit 819e5ea7c290f842c51ead8b4a6593678aeef6bf
Author: “attilapiros” 
AuthorDate: Wed Jan 16 09:00:21 2019 -0600

[SPARK-26615][CORE] Fixing transport server/client resource leaks in the 
core unittests

## What changes were proposed in this pull request?

Fixing resource leaks where TransportClient/TransportServer instances are 
not closed properly.

In StandaloneSchedulerBackend the null check is added because during the 
SparkContextSchedulerCreationSuite #"local-cluster" test it turned out that 
client is not initialised as 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend#start isn't 
called. It throw an NPE and some resource remained in open.

## How was this patch tested?

By executing the unittests and using some extra temporary logging for 
counting created and closed TransportClient/TransportServer instances.

Closes #23540 from attilapiros/leaks.

Authored-by: “attilapiros” 
Signed-off-by: Sean Owen 
---
 .../cluster/StandaloneSchedulerBackend.scala   |   5 +-
 .../spark/SparkContextSchedulerCreationSuite.scala | 103 ---
 .../spark/deploy/client/AppClientSuite.scala   |  75 ++-
 .../apache/spark/deploy/master/MasterSuite.scala   | 111 +
 .../apache/spark/storage/BlockManagerSuite.scala   | 138 +
 5 files changed, 228 insertions(+), 204 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
index 66080b6..e0605fe 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
@@ -224,8 +224,9 @@ private[spark] class StandaloneSchedulerBackend(
 if (stopping.compareAndSet(false, true)) {
   try {
 super.stop()
-client.stop()
-
+if (client != null) {
+  client.stop()
+}
 val callback = shutdownCallback
 if (callback != null) {
   callback(this)
diff --git 
a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
index f8938df..811b975 100644
--- 
a/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
@@ -23,110 +23,129 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.scheduler.{SchedulerBackend, TaskScheduler, 
TaskSchedulerImpl}
 import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
 import org.apache.spark.scheduler.local.LocalSchedulerBackend
+import org.apache.spark.util.Utils
 
 
 class SparkContextSchedulerCreationSuite
   extends SparkFunSuite with LocalSparkContext with PrivateMethodTester with 
Logging {
 
-  def createTaskScheduler(master: String): TaskSchedulerImpl =
-createTaskScheduler(master, "client")
+  def noOp(taskSchedulerImpl: TaskSchedulerImpl): Unit = {}
 
-  def createTaskScheduler(master: String, deployMode: String): 
TaskSchedulerImpl =
-createTaskScheduler(master, deployMode, new SparkConf())
+  def createTaskScheduler(master: String)(body: TaskSchedulerImpl => Unit = 
noOp): Unit =
+createTaskScheduler(master, "client")(body)
+
+  def createTaskScheduler(master: String, deployMode: String)(
+  body: TaskSchedulerImpl => Unit): Unit =
+createTaskScheduler(master, deployMode, new SparkConf())(body)
 
   def createTaskScheduler(
   master: String,
   deployMode: String,
-  conf: SparkConf): TaskSchedulerImpl = {
+  conf: SparkConf)(body: TaskSchedulerImpl => Unit): Unit = {
 // Create local SparkContext to setup a SparkEnv. We don't actually want 
to start() the
 // real schedulers, so we don't want to create a full SparkContext with 
the desired scheduler.
 sc = new SparkContext("local", "test", conf)
 val createTaskSchedulerMethod =
   PrivateMethod[Tuple2[SchedulerBackend, 
TaskScheduler]]('createTaskScheduler)
-val (_, sched) = SparkContext invokePrivate createTaskSchedulerMethod(sc, 
master, deployMode)
-sched.asInstanceOf[TaskSchedulerImpl]
+val (_, sched) =
+  SparkContext invokePrivate createTaskSchedulerMethod(sc, master, 
deployMode)
+try {
+  body(sched.asInstanceOf[TaskSchedulerImpl])
+} finally {
+  Utils.tryLogNonFatalError {
+sched.stop()
+  }
+}
   }

[spark] branch branch-2.3 updated (18c138b -> 8319ba7)

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 18c138b  Revert "[SPARK-26576][SQL] Broadcast hint not applied to 
partitioned table"
 add b5ea933  Preparing Spark release v2.3.3-rc1
 new 8319ba7  Preparing development version 2.3.4-SNAPSHOT

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/01: Preparing development version 2.3.4-SNAPSHOT

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 8319ba736c8909aa944e28d1c8de501926e9f50f
Author: Takeshi Yamamuro 
AuthorDate: Wed Jan 16 13:22:01 2019 +

Preparing development version 2.3.4-SNAPSHOT
---
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 6ec4966..a82446e 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.3.3
+Version: 2.3.4
 Title: R Frontend for Apache Spark
 Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 6a8cd4f..612a1b8 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 6010b6e..5547e97 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 8b5d3c8..119dde2 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index dd27a24..dba5224 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index aded5e7d..56902a3 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index a50f612..5302d95 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index 8112ca4..232ebfa 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+2.3.4-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/unsafe/pom.xml b/common/unsafe/pom.xml
index 0d5f61f..f0baa2a 100644
--- a/common/unsafe/pom.xml
+++ b/common/unsafe/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3
+

[spark] tag v2.3.3-rc1 created (now b5ea933)

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to tag v2.3.3-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git.


  at b5ea933  (commit)
This tag includes the following new commits:

 new b5ea933  Preparing Spark release v2.3.3-rc1

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/01: Preparing Spark release v2.3.3-rc1

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to tag v2.3.3-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git

commit b5ea9330e3072e99841270b10dc1d2248127064b
Author: Takeshi Yamamuro 
AuthorDate: Wed Jan 16 13:21:25 2019 +

Preparing Spark release v2.3.3-rc1
---
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 2 +-
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 40 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/assembly/pom.xml b/assembly/pom.xml
index f8b15cc..6a8cd4f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index e412a47..6010b6e 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index d8f9a3d..8b5d3c8 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index a1a4f87..dd27a24 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index e650978..aded5e7d 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 350e3cb..a50f612 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index e7fea41..8112ca4 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/common/unsafe/pom.xml b/common/unsafe/pom.xml
index 601cc5d..0d5f61f 100644
--- a/common/unsafe/pom.xml
+++ b/common/unsafe/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../../pom.xml
   
 
diff --git a/core/pom.xml b/core/pom.xml
index 2a7e644..930128d 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.3
 ../pom.xml
   
 
diff --git a/docs/_config.yml b/docs/_config.yml
index 7629f5f..8e9c3b5 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -14,7 +14,7 @@ include:
 
 # These allow

[spark] tag v2.3.3-rc1 deleted (was 2e01a70)

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to tag v2.3.3-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git.


*** WARNING: tag v2.3.3-rc1 was deleted! ***

 was 2e01a70  Preparing Spark release v2.3.3-rc1

The revisions that were on this tag are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.3 updated (1979712 -> 18c138b)

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git.


omit 1979712  [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak 
in Structured Streaming R tests
omit 3137dca  Preparing development version 2.3.4-SNAPSHOT
omit 2e01a70  Preparing Spark release v2.3.3-rc1
 new 2a82295  [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak 
in Structured Streaming R tests
 new 18c138b  Revert "[SPARK-26576][SQL] Broadcast hint not applied to 
partitioned table"

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (1979712)
\
 N -- N -- N   refs/heads/branch-2.3 (18c138b)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 R/pkg/DESCRIPTION |  2 +-
 assembly/pom.xml  |  2 +-
 common/kvstore/pom.xml|  2 +-
 common/network-common/pom.xml |  2 +-
 common/network-shuffle/pom.xml|  2 +-
 common/network-yarn/pom.xml   |  2 +-
 common/sketch/pom.xml |  2 +-
 common/tags/pom.xml   |  2 +-
 common/unsafe/pom.xml |  2 +-
 core/pom.xml  |  2 +-
 docs/_config.yml  |  4 ++--
 examples/pom.xml  |  2 +-
 external/docker-integration-tests/pom.xml |  2 +-
 external/flume-assembly/pom.xml   |  2 +-
 external/flume-sink/pom.xml   |  2 +-
 external/flume/pom.xml|  2 +-
 external/kafka-0-10-assembly/pom.xml  |  2 +-
 external/kafka-0-10-sql/pom.xml   |  2 +-
 external/kafka-0-10/pom.xml   |  2 +-
 external/kafka-0-8-assembly/pom.xml   |  2 +-
 external/kafka-0-8/pom.xml|  2 +-
 external/kinesis-asl-assembly/pom.xml |  2 +-
 external/kinesis-asl/pom.xml  |  2 +-
 external/spark-ganglia-lgpl/pom.xml   |  2 +-
 graphx/pom.xml|  2 +-
 hadoop-cloud/pom.xml  |  2 +-
 launcher/pom.xml  |  2 +-
 mllib-local/pom.xml   |  2 +-
 mllib/pom.xml |  2 +-
 pom.xml   |  2 +-
 python/pyspark/version.py |  2 +-
 repl/pom.xml  |  2 +-
 resource-managers/kubernetes/core/pom.xml |  2 +-
 resource-managers/mesos/pom.xml   |  2 +-
 resource-managers/yarn/pom.xml|  2 +-
 sql/catalyst/pom.xml  |  2 +-
 .../apache/spark/sql/catalyst/planning/patterns.scala |  3 +++
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 sql/hive/pom.xml  |  2 +-
 .../execution/PruneFileSourcePartitionsSuite.scala| 19 +--
 streaming/pom.xml |  2 +-
 tools/pom.xml |  2 +-
 43 files changed, 46 insertions(+), 60 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/02: [SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured Streaming R tests

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 2a82295bd4b904f50d42a7585a72f91fff75353d
Author: Shixiong Zhu 
AuthorDate: Wed Nov 21 09:31:12 2018 +0800

[SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Structured 
Streaming R tests

## What changes were proposed in this pull request?

Stop the streaming query in `Specify a schema by using a DDL-formatted 
string when reading` to avoid outputting annoying logs.

## How was this patch tested?

Jenkins

Closes #23089 from zsxwing/SPARK-26120.

Authored-by: Shixiong Zhu 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 4b7f7ef5007c2c8a5090f22c6e08927e9f9a407b)
Signed-off-by: Felix Cheung 
---
 R/pkg/tests/fulltests/test_streaming.R | 1 +
 1 file changed, 1 insertion(+)

diff --git a/R/pkg/tests/fulltests/test_streaming.R 
b/R/pkg/tests/fulltests/test_streaming.R
index bfb1a04..6f0d2ae 100644
--- a/R/pkg/tests/fulltests/test_streaming.R
+++ b/R/pkg/tests/fulltests/test_streaming.R
@@ -127,6 +127,7 @@ test_that("Specify a schema by using a DDL-formatted string 
when reading", {
   expect_false(awaitTermination(q, 5 * 1000))
   callJMethod(q@ssq, "processAllAvailable")
   expect_equal(head(sql("SELECT count(*) FROM people3"))[[1]], 3)
+  stopQuery(q)
 
   expect_error(read.stream(path = parquetPath, schema = "name stri"),
"DataType stri is not supported.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 02/02: Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table"

2019-01-16 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 18c138bf01cd43edc6db7115b90c4e7ae7126392
Author: Takeshi Yamamuro 
AuthorDate: Wed Jan 16 21:56:39 2019 +0900

Revert "[SPARK-26576][SQL] Broadcast hint not applied to partitioned table"

This reverts commit 87c2c11e742a8b35699f68ec2002f817c56bef87.
---
 .../apache/spark/sql/catalyst/planning/patterns.scala |  3 +++
 .../execution/PruneFileSourcePartitionsSuite.scala| 19 +--
 2 files changed, 4 insertions(+), 18 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
index a91063b..cc391aa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
@@ -65,6 +65,9 @@ object PhysicalOperation extends PredicateHelper {
 val substitutedCondition = substitute(aliases)(condition)
 (fields, filters ++ splitConjunctivePredicates(substitutedCondition), 
other, aliases)
 
+  case h: ResolvedHint =>
+collectProjectsAndFilters(h.child)
+
   case other =>
 (None, Nil, other, Map.empty)
 }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
index 8a9adf7..9438418 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
@@ -17,20 +17,15 @@
 
 package org.apache.spark.sql.hive.execution
 
-import org.scalatest.Matchers._
-
 import org.apache.spark.sql.QueryTest
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project, ResolvedHint}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
 import org.apache.spark.sql.catalyst.rules.RuleExecutor
 import org.apache.spark.sql.execution.datasources.{CatalogFileIndex, 
HadoopFsRelation, LogicalRelation, PruneFileSourcePartitions}
 import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
-import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
-import org.apache.spark.sql.functions.broadcast
 import org.apache.spark.sql.hive.test.TestHiveSingleton
-import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SQLTestUtils
 import org.apache.spark.sql.types.StructType
 
@@ -96,16 +91,4 @@ class PruneFileSourcePartitionsSuite extends QueryTest with 
SQLTestUtils with Te
   assert(size2 < tableStats.get.sizeInBytes)
 }
   }
-
-  test("SPARK-26576 Broadcast hint not applied to partitioned table") {
-withTable("tbl") {
-  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-spark.range(10).selectExpr("id", "id % 3 as 
p").write.partitionBy("p").saveAsTable("tbl")
-val df = spark.table("tbl")
-val qe = df.join(broadcast(df), "p").queryExecution
-qe.optimizedPlan.collect { case _: ResolvedHint => } should have size 1
-qe.sparkPlan.collect { case j: BroadcastHashJoinExec => j } should 
have size 1
-  }
-}
-  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] packtpartner commented on issue #172: Update documentation.md

2019-01-16 Thread GitBox

packtpartner commented on issue #172: Update documentation.md
URL: https://github.com/apache/spark-website/pull/172#issuecomment-454746306
 
 
   We have added a new book listing like the previous ones.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] Diff for: [GitHub] wangshuo128 closed pull request #23355: [SPARK-26418][SHUFFLE] Only OpenBlocks without any ChunkFetch for one stream will cause memory leak in ExternalShuffleService

2019-01-16 Thread GitBox

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
index 20d840baeaf6c..8d46671f9581a 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
@@ -124,6 +124,9 @@ public void setClientId(String id) {
* to be returned in the same order that they were requested, assuming only 
a single
* TransportClient is used to fetch the chunks.
*
+   * OpenBlocks and following FetchChunk requests for a stream should be sent 
by the same
+   * TransportClient to avoid potential memory leak on server side.
+   *
* @param streamId Identifier that refers to a stream in the remote 
StreamManager. This should
* be agreed upon by client and server beforehand.
* @param chunkIndex 0-based index of the chunk to fetch
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java
 
b/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java
index f08d8b0f984cf..43c3d23b6304d 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java
@@ -90,7 +90,6 @@ protected void channelRead0(
 ManagedBuffer buf;
 try {
   streamManager.checkAuthorization(client, msg.streamChunkId.streamId);
-  streamManager.registerChannel(channel, msg.streamChunkId.streamId);
   buf = streamManager.getChunk(msg.streamChunkId.streamId, 
msg.streamChunkId.chunkIndex);
 } catch (Exception e) {
   logger.error(String.format("Error opening block %s for request from %s",
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java
 
b/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java
index 0f6a8824d95e5..3e1f77cdc20ed 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java
@@ -23,6 +23,7 @@
 import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.atomic.AtomicLong;
 
+import com.google.common.annotations.VisibleForTesting;
 import com.google.common.base.Preconditions;
 import io.netty.channel.Channel;
 import org.apache.commons.lang3.tuple.ImmutablePair;
@@ -202,4 +203,8 @@ public long registerStream(String appId, 
Iterator buffers) {
 return myStreamId;
   }
 
+  @VisibleForTesting
+  public long getStreamCount() {
+return streams.size();
+  }
 }
diff --git 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
index 098fa7974b87b..c59433c44c5c0 100644
--- 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
+++ 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
@@ -92,6 +92,7 @@ protected void handleMessage(
 checkAuth(client, msg.appId);
 long streamId = streamManager.registerStream(client.getClientId(),
   new ManagedBufferIterator(msg.appId, msg.execId, msg.blockIds));
+streamManager.registerChannel(client.getChannel(), streamId);
 if (logger.isTraceEnabled()) {
   logger.trace("Registered streamId {} with {} buffers for client {} 
from host {}",
streamId,
diff --git 
a/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java
 
b/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java
new file mode 100644
index 0..4a721260e3db3
--- /dev/null
+++ 
b/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/StreamStatesCleanupSuite.java
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the

40 matches

Mail list logo