spark git commit: [SPARK-23529][K8S] Support mounting volumes

2018-07-10 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 74a8d6308 -> 5ff1b9ba1


[SPARK-23529][K8S] Support mounting volumes

This PR continues #21095 and intersects with #21238. I've added volume mounts 
as a separate step and added PersistantVolumeClaim support.

There is a fundamental problem with how we pass the options through spark conf 
to fabric8. For each volume type and all possible volume options we would have 
to implement some custom code to map config values to fabric8 calls. This will 
result in big body of code we would have to support and means that Spark will 
always be somehow out of sync with k8s.

I think there needs to be a discussion on how to proceed correctly (eg use 
PodPreset instead)



Due to the complications of provisioning and managing actual resources this PR 
addresses only volume mounting of already present resources.


- [x] emptyDir support
- [x] Testing
- [x] Documentation
- [x] KubernetesVolumeUtils tests

Author: Andrew Korzhuev 
Author: madanadit 

Closes #21260 from andrusha/k8s-vol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ff1b9ba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ff1b9ba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ff1b9ba

Branch: refs/heads/master
Commit: 5ff1b9ba1983d5601add62aef64a3e87d07050eb
Parents: 74a8d63
Author: Andrew Korzhuev 
Authored: Tue Jul 10 22:53:44 2018 -0700
Committer: Felix Cheung 
Committed: Tue Jul 10 22:53:44 2018 -0700

--
 docs/running-on-kubernetes.md   |  48 +++
 .../org/apache/spark/deploy/k8s/Config.scala|  12 ++
 .../spark/deploy/k8s/KubernetesConf.scala   |  11 ++
 .../spark/deploy/k8s/KubernetesUtils.scala  |   2 -
 .../spark/deploy/k8s/KubernetesVolumeSpec.scala |  38 +
 .../deploy/k8s/KubernetesVolumeUtils.scala  | 110 ++
 .../k8s/features/BasicDriverFeatureStep.scala   |   5 +-
 .../k8s/features/BasicExecutorFeatureStep.scala |   5 +-
 .../k8s/features/MountVolumesFeatureStep.scala  |  79 ++
 .../k8s/submit/KubernetesDriverBuilder.scala|  31 ++--
 .../cluster/k8s/KubernetesExecutorBuilder.scala |  38 ++---
 .../deploy/k8s/KubernetesVolumeUtilsSuite.scala | 106 ++
 .../features/BasicDriverFeatureStepSuite.scala  |  23 +--
 .../BasicExecutorFeatureStepSuite.scala |   3 +
 ...rKubernetesCredentialsFeatureStepSuite.scala |   3 +
 .../DriverServiceFeatureStepSuite.scala |   6 +
 .../features/EnvSecretsFeatureStepSuite.scala   |   1 +
 .../features/LocalDirsFeatureStepSuite.scala|   3 +-
 .../features/MountSecretsFeatureStepSuite.scala |   1 +
 .../features/MountVolumesFeatureStepSuite.scala | 144 +++
 .../bindings/JavaDriverFeatureStepSuite.scala   |   1 +
 .../bindings/PythonDriverFeatureStepSuite.scala |   2 +
 .../spark/deploy/k8s/submit/ClientSuite.scala   |   1 +
 .../submit/KubernetesDriverBuilderSuite.scala   |  45 +-
 .../k8s/KubernetesExecutorBuilderSuite.scala|  38 -
 25 files changed, 705 insertions(+), 51 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ff1b9ba/docs/running-on-kubernetes.md
--
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index 408e446..7149616 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -629,6 +629,54 @@ specific to Spark on Kubernetes.
Add as an environment variable to the executor container with name EnvName 
(case sensitive), the value referenced by key  key  in the data of 
the referenced https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables;>Kubernetes
 Secret. For example,
spark.kubernetes.executor.secrets.ENV_VAR=spark-secret:key.
   
+   
+
+  
spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path
+  (none)
+  
+   Add the https://kubernetes.io/docs/concepts/storage/volumes/;>Kubernetes 
Volume named VolumeName of the VolumeType type to 
the driver pod on the path specified in the value. For example,
+   
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.mount.path=/checkpoint.
+  
+
+
+  
spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.readOnly
+  (none)
+  
+   Specify if the mounted volume is read only or not. For example,
+   
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.mount.readOnly=false.
+  
+
+
+  
spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].options.[OptionName]
+  (none)
+  
+   Configure https://kubernetes.io/docs/concepts/storage/volumes/;>Kubernetes 
Volume options passed to the Kubernetes with OptionName as key 
having specified value, must conform with Kubernetes option format. For example,
+   

[spark] Git Push Summary

2018-07-10 Thread jshao
Repository: spark
Updated Tags:  refs/tags/v2.3.2-rc2 [created] 307499e1a

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[2/2] spark git commit: Preparing development version 2.3.3-SNAPSHOT

2018-07-10 Thread jshao
Preparing development version 2.3.3-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86457a16
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86457a16
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86457a16

Branch: refs/heads/branch-2.3
Commit: 86457a16de20eb126d0569d12f51b2e427fd03c3
Parents: 307499e
Author: Saisai Shao 
Authored: Wed Jul 11 05:27:12 2018 +
Committer: Saisai Shao 
Committed: Wed Jul 11 05:27:12 2018 +

--
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/86457a16/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 8df2635..6ec4966 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.3.2
+Version: 2.3.3
 Title: R Frontend for Apache Spark
 Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),

http://git-wip-us.apache.org/repos/asf/spark/blob/86457a16/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 57485fc..f8b15cc 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/86457a16/common/kvstore/pom.xml
--
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 53e58c2..e412a47 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/86457a16/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index d05647c..d8f9a3d 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/86457a16/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 8d46761..a1a4f87 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-

[1/2] spark git commit: Preparing Spark release v2.3.2-rc2

2018-07-10 Thread jshao
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 19542f5de -> 86457a16d


Preparing Spark release v2.3.2-rc2


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/307499e1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/307499e1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/307499e1

Branch: refs/heads/branch-2.3
Commit: 307499e1a99c6ad3ce0b978626894ea2c1e3807e
Parents: 19542f5
Author: Saisai Shao 
Authored: Wed Jul 11 05:27:02 2018 +
Committer: Saisai Shao 
Committed: Wed Jul 11 05:27:02 2018 +

--
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/307499e1/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 6ec4966..8df2635 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.3.3
+Version: 2.3.2
 Title: R Frontend for Apache Spark
 Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),

http://git-wip-us.apache.org/repos/asf/spark/blob/307499e1/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index f8b15cc..57485fc 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/307499e1/common/kvstore/pom.xml
--
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index e412a47..53e58c2 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/307499e1/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index d8f9a3d..d05647c 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/307499e1/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index a1a4f87..8d46761 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml

svn commit: r28049 - in /dev/spark/2.3.3-SNAPSHOT-2018_07_10_22_01-19542f5-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Wed Jul 11 05:16:16 2018
New Revision: 28049

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_07_10_22_01-19542f5 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24165][SQL] Fixing conditional expressions to handle nullability of nested types

2018-07-10 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 1f94bf492 -> 74a8d6308


[SPARK-24165][SQL] Fixing conditional expressions to handle nullability of 
nested types

## What changes were proposed in this pull request?
This PR is proposing a fix for the output data type of ```If``` and 
```CaseWhen``` expression. Upon till now, the implementation of exprassions has 
ignored nullability of nested types from different execution branches and 
returned the type of the first branch.

This could lead to an unwanted ```NullPointerException``` from other 
expressions depending on a ```If```/```CaseWhen``` expression.

Example:
```
val rows = new util.ArrayList[Row]()
rows.add(Row(true, ("a", 1)))
rows.add(Row(false, (null, 2)))
val schema = StructType(Seq(
  StructField("cond", BooleanType, false),
  StructField("s", StructType(Seq(
StructField("val1", StringType, true),
StructField("val2", IntegerType, false)
  )), false)
))

val df = spark.createDataFrame(rows, schema)

df
  .select(when('cond, struct(lit("x").as("val1"), 
lit(10).as("val2"))).otherwise('s) as "res")
  .select('res.getField("val1"))
  .show()
```
Exception:
```
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
...
```
Output schema:
```
root
 |-- res.val1: string (nullable = false)
```

## How was this patch tested?
New test cases added into
- DataFrameSuite.scala
- conditionalExpressions.scala

Author: Marek Novotny 

Closes #21687 from mn-mikke/SPARK-24165.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74a8d630
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74a8d630
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74a8d630

Branch: refs/heads/master
Commit: 74a8d6308bfa6e7ed4c64e1175c77eb3114baed5
Parents: 1f94bf4
Author: Marek Novotny 
Authored: Wed Jul 11 12:21:03 2018 +0800
Committer: Wenchen Fan 
Committed: Wed Jul 11 12:21:03 2018 +0800

--
 .../sql/catalyst/analysis/TypeCoercion.scala| 22 --
 .../sql/catalyst/expressions/Expression.scala   | 37 ++-
 .../expressions/conditionalExpressions.scala| 28 
 .../ConditionalExpressionSuite.scala| 70 
 .../org/apache/spark/sql/DataFrameSuite.scala   | 58 
 5 files changed, 195 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/74a8d630/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 72908c1..e8331c9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -173,6 +173,18 @@ object TypeCoercion {
   }
 
   /**
+   * The method finds a common type for data types that differ only in 
nullable, containsNull
+   * and valueContainsNull flags. If the input types are too different, None 
is returned.
+   */
+  def findCommonTypeDifferentOnlyInNullFlags(t1: DataType, t2: DataType): 
Option[DataType] = {
+if (t1 == t2) {
+  Some(t1)
+} else {
+  findTypeForComplex(t1, t2, findCommonTypeDifferentOnlyInNullFlags)
+}
+  }
+
+  /**
* Case 2 type widening (see the classdoc comment above for TypeCoercion).
*
* i.e. the main difference with [[findTightestCommonType]] is that here we 
allow some
@@ -660,8 +672,8 @@ object TypeCoercion {
   object CaseWhenCoercion extends TypeCoercionRule {
 override protected def coerceTypes(
 plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
-  case c: CaseWhen if c.childrenResolved && !c.valueTypesEqual =>
-val maybeCommonType = findWiderCommonType(c.valueTypes)
+  case c: CaseWhen if c.childrenResolved && 
!c.areInputTypesForMergingEqual =>
+val maybeCommonType = findWiderCommonType(c.inputTypesForMerging)
 maybeCommonType.map { commonType =>
   var changed = false
   val newBranches = c.branches.map { case (condition, value) =>
@@ -693,10 +705,10 @@ object TypeCoercion {
 plan: LogicalPlan): LogicalPlan = plan 

svn commit: r28048 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_10_20_01-1f94bf4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Wed Jul 11 03:15:48 2018
New Revision: 28048

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_10_20_01-1f94bf4 docs


[This commit notification would consist of 1467 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24530][PYTHON] Add a control to force Python version in Sphinx via environment variable, SPHINXPYTHON

2018-07-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 72eb97ce9 -> 19542f5de


[SPARK-24530][PYTHON] Add a control to force Python version in Sphinx via 
environment variable, SPHINXPYTHON

## What changes were proposed in this pull request?

This PR proposes to add `SPHINXPYTHON` environment variable to control the 
Python version used by Sphinx.

The motivation of this environment variable is, it seems not properly rendering 
some signatures in the Python documentation when Python 2 is used by Sphinx. 
See the JIRA's case. It should be encouraged to use Python 3, but looks we will 
probably live with this problem for a long while in any event.

For the default case of `make html`, it keeps previous behaviour and use 
`SPHINXBUILD` as it was. If `SPHINXPYTHON` is set, then it forces Sphinx to use 
the specific Python version.

```
$ SPHINXPYTHON=python3 make html
python3 -msphinx -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

1. if `SPHINXPYTHON` is set, use Python. If `SPHINXBUILD` is set, use 
sphinx-build.
2. If both are set, `SPHINXBUILD` has a higher priority over `SPHINXPYTHON`
3. By default, `SPHINXBUILD` is used as 'sphinx-build'.

Probably, we can somehow work around this via explicitly setting `SPHINXBUILD` 
but `sphinx-build` can't be easily distinguished since it (at least in my 
environment and up to my knowledge) doesn't replace `sphinx-build` when newer 
Sphinx is installed in different Python version. It confuses and doesn't warn 
for its Python version.

## How was this patch tested?

Manually tested:

**`python` (Python 2.7) in the path with Sphinx:**

```
$ make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**`python` (Python 2.7) in the path without Sphinx:**

```
$ make html
Makefile:8: *** The 'sphinx-build' command was not found. Make sure you have 
Sphinx installed, then set the SPHINXBUILD environment variable to point to the 
full path of the 'sphinx-build' executable. Alternatively you can add the 
directory with the executable to your PATH. If you don't have Sphinx installed, 
grab it from http://sphinx-doc.org/.  Stop.
```

**`SPHINXPYTHON` set `python` (Python 2.7)  with Sphinx:**

```
$ SPHINXPYTHON=python make html
Makefile:35: *** Note that Python 3 is required to generate PySpark 
documentation correctly for now. Current Python executable was less than Python 
3. See SPARK-24530. To force Sphinx to use a specific Python executable, please 
set SPHINXPYTHON to point to the Python 3 executable..  Stop.
```

**`SPHINXPYTHON` set `python` (Python 2.7)  without Sphinx:**

```
$ SPHINXPYTHON=python make html
Makefile:35: *** Note that Python 3 is required to generate PySpark 
documentation correctly for now. Current Python executable was less than Python 
3. See SPARK-24530. To force Sphinx to use a specific Python executable, please 
set SPHINXPYTHON to point to the Python 3 executable..  Stop.
```

**`SPHINXPYTHON` set `python3` with Sphinx:**

```
$ SPHINXPYTHON=python3 make html
python3 -msphinx -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**`SPHINXPYTHON` set `python3` without Sphinx:**

```
$ SPHINXPYTHON=python3 make html
Makefile:39: *** Python executable 'python3' did not have Sphinx installed. 
Make sure you have Sphinx installed, then set the SPHINXPYTHON environment 
variable to point to the Python executable having Sphinx installed. If you 
don't have Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.
```

**`SPHINXBUILD` set:**

```
$ SPHINXBUILD=sphinx-build make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**Both `SPHINXPYTHON` and `SPHINXBUILD` are set:**

```
$ SPHINXBUILD=sphinx-build SPHINXPYTHON=python make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

Author: hyukjinkwon 

Closes #21659 from HyukjinKwon/SPARK-24530.

(cherry picked from commit 1f94bf492c3bce3b61f7fec6132b50e06dea94a8)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/19542f5d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/19542f5d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/19542f5d

Branch: refs/heads/branch-2.3
Commit: 19542f5de390f6096745e478dd9339958db899e8
Parents: 72eb97c
Author: hyukjinkwon 
Authored: Wed Jul 11 10:10:07 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 11 10:10:29 2018 +0800

--
 python/docs/Makefile | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/19542f5d/python/docs/Makefile
--
diff --git a/python/docs/Makefile b/python/docs/Makefile

spark git commit: [SPARK-24530][PYTHON] Add a control to force Python version in Sphinx via environment variable, SPHINXPYTHON

2018-07-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6078b891d -> 1f94bf492


[SPARK-24530][PYTHON] Add a control to force Python version in Sphinx via 
environment variable, SPHINXPYTHON

## What changes were proposed in this pull request?

This PR proposes to add `SPHINXPYTHON` environment variable to control the 
Python version used by Sphinx.

The motivation of this environment variable is, it seems not properly rendering 
some signatures in the Python documentation when Python 2 is used by Sphinx. 
See the JIRA's case. It should be encouraged to use Python 3, but looks we will 
probably live with this problem for a long while in any event.

For the default case of `make html`, it keeps previous behaviour and use 
`SPHINXBUILD` as it was. If `SPHINXPYTHON` is set, then it forces Sphinx to use 
the specific Python version.

```
$ SPHINXPYTHON=python3 make html
python3 -msphinx -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

1. if `SPHINXPYTHON` is set, use Python. If `SPHINXBUILD` is set, use 
sphinx-build.
2. If both are set, `SPHINXBUILD` has a higher priority over `SPHINXPYTHON`
3. By default, `SPHINXBUILD` is used as 'sphinx-build'.

Probably, we can somehow work around this via explicitly setting `SPHINXBUILD` 
but `sphinx-build` can't be easily distinguished since it (at least in my 
environment and up to my knowledge) doesn't replace `sphinx-build` when newer 
Sphinx is installed in different Python version. It confuses and doesn't warn 
for its Python version.

## How was this patch tested?

Manually tested:

**`python` (Python 2.7) in the path with Sphinx:**

```
$ make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**`python` (Python 2.7) in the path without Sphinx:**

```
$ make html
Makefile:8: *** The 'sphinx-build' command was not found. Make sure you have 
Sphinx installed, then set the SPHINXBUILD environment variable to point to the 
full path of the 'sphinx-build' executable. Alternatively you can add the 
directory with the executable to your PATH. If you don't have Sphinx installed, 
grab it from http://sphinx-doc.org/.  Stop.
```

**`SPHINXPYTHON` set `python` (Python 2.7)  with Sphinx:**

```
$ SPHINXPYTHON=python make html
Makefile:35: *** Note that Python 3 is required to generate PySpark 
documentation correctly for now. Current Python executable was less than Python 
3. See SPARK-24530. To force Sphinx to use a specific Python executable, please 
set SPHINXPYTHON to point to the Python 3 executable..  Stop.
```

**`SPHINXPYTHON` set `python` (Python 2.7)  without Sphinx:**

```
$ SPHINXPYTHON=python make html
Makefile:35: *** Note that Python 3 is required to generate PySpark 
documentation correctly for now. Current Python executable was less than Python 
3. See SPARK-24530. To force Sphinx to use a specific Python executable, please 
set SPHINXPYTHON to point to the Python 3 executable..  Stop.
```

**`SPHINXPYTHON` set `python3` with Sphinx:**

```
$ SPHINXPYTHON=python3 make html
python3 -msphinx -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**`SPHINXPYTHON` set `python3` without Sphinx:**

```
$ SPHINXPYTHON=python3 make html
Makefile:39: *** Python executable 'python3' did not have Sphinx installed. 
Make sure you have Sphinx installed, then set the SPHINXPYTHON environment 
variable to point to the Python executable having Sphinx installed. If you 
don't have Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.
```

**`SPHINXBUILD` set:**

```
$ SPHINXBUILD=sphinx-build make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

**Both `SPHINXPYTHON` and `SPHINXBUILD` are set:**

```
$ SPHINXBUILD=sphinx-build SPHINXPYTHON=python make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.7.5
...
```

Author: hyukjinkwon 

Closes #21659 from HyukjinKwon/SPARK-24530.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1f94bf49
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1f94bf49
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1f94bf49

Branch: refs/heads/master
Commit: 1f94bf492c3bce3b61f7fec6132b50e06dea94a8
Parents: 6078b89
Author: hyukjinkwon 
Authored: Wed Jul 11 10:10:07 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 11 10:10:07 2018 +0800

--
 python/docs/Makefile | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1f94bf49/python/docs/Makefile
--
diff --git a/python/docs/Makefile b/python/docs/Makefile
index b8e0794..1ed1f33 100644
--- a/python/docs/Makefile
+++ b/python/docs/Makefile
@@ -1,19 +1,44 @@
 # 

spark git commit: [SPARK-24730][SS] Add policy to choose max as global watermark when streaming query has multiple watermarks

2018-07-10 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master 32cb50835 -> 6078b891d


[SPARK-24730][SS] Add policy to choose max as global watermark when streaming 
query has multiple watermarks

## What changes were proposed in this pull request?

Currently, when a streaming query has multiple watermark, the policy is to 
choose the min of them as the global watermark. This is safe to do as the 
global watermark moves with the slowest stream, and is therefore is safe as it 
does not unexpectedly drop some data as late, etc. While this is indeed the 
safe thing to do, in some cases, you may want the watermark to advance with the 
fastest stream, that is, take the max of multiple watermarks. This PR is to add 
that configuration. It makes the following changes.

- Adds a configuration to specify max as the policy.
- Saves the configuration in OffsetSeqMetadata because changing it in the 
middle can lead to unpredictable results.
   - For old checkpoints without the configuration, it assumes the default 
policy as min (irrespective of the policy set at the session where the query is 
being restarted). This is to ensure that existing queries are affected in any 
way.

TODO
- [ ] Add a test for recovery from existing checkpoints.

## How was this patch tested?
New unit test

Author: Tathagata Das 

Closes #21701 from tdas/SPARK-24730.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6078b891
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6078b891
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6078b891

Branch: refs/heads/master
Commit: 6078b891da8fe7fc36579699473168ae7443284c
Parents: 32cb508
Author: Tathagata Das 
Authored: Tue Jul 10 18:03:40 2018 -0700
Committer: Tathagata Das 
Committed: Tue Jul 10 18:03:40 2018 -0700

--
 .../org/apache/spark/sql/internal/SQLConf.scala |  15 ++
 .../streaming/MicroBatchExecution.scala |   4 +-
 .../sql/execution/streaming/OffsetSeq.scala |  37 -
 .../execution/streaming/WatermarkTracker.scala  |  90 ++--
 .../commits/0   |   2 +
 .../commits/1   |   2 +
 .../metadata|   1 +
 .../offsets/0   |   4 +
 .../offsets/1   |   4 +
 .../sql/streaming/EventTimeWatermarkSuite.scala | 136 ++-
 10 files changed, 276 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6078b891/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 50965c1..ae56cc9 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -875,6 +875,21 @@ object SQLConf {
   .stringConf
   
.createWithDefault("org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol")
 
+  val STREAMING_MULTIPLE_WATERMARK_POLICY =
+buildConf("spark.sql.streaming.multipleWatermarkPolicy")
+  .doc("Policy to calculate the global watermark value when there are 
multiple watermark " +
+"operators in a streaming query. The default value is 'min' which 
chooses " +
+"the minimum watermark reported across multiple operators. Other 
alternative value is" +
+"'max' which chooses the maximum across multiple operators." +
+"Note: This configuration cannot be changed between query restarts 
from the same " +
+"checkpoint location.")
+  .stringConf
+  .checkValue(
+str => Set("min", "max").contains(str.toLowerCase),
+"Invalid value for 'spark.sql.streaming.multipleWatermarkPolicy'. " +
+  "Valid values are 'min' and 'max'")
+  .createWithDefault("min") // must be same as 
MultipleWatermarkPolicy.DEFAULT_POLICY_NAME
+
   val OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD =
 buildConf("spark.sql.objectHashAggregate.sortBased.fallbackThreshold")
   .internal()

http://git-wip-us.apache.org/repos/asf/spark/blob/6078b891/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
index 17ffa2a..16651dd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
+++ 

svn commit: r28045 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_10_12_01-32cb508-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Tue Jul 10 19:16:13 2018
New Revision: 28045

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_10_12_01-32cb508 docs


[This commit notification would consist of 1467 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24662][SQL][SS] Support limit in structured streaming

2018-07-10 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master e0559f238 -> 32cb50835


[SPARK-24662][SQL][SS] Support limit in structured streaming

## What changes were proposed in this pull request?

Support the LIMIT operator in structured streaming.

For streams in append or complete output mode, a stream with a LIMIT operator 
will return no more than the specified number of rows. LIMIT is still 
unsupported for the update output mode.

This change reverts 
https://github.com/apache/spark/commit/e4fee395ecd93ad4579d9afbf0861f82a303e563 
as part of it because it is a better and more complete implementation.

## How was this patch tested?

New and existing unit tests.

Author: Mukul Murthy 

Closes #21662 from mukulmurthy/SPARK-24662.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/32cb5083
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/32cb5083
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/32cb5083

Branch: refs/heads/master
Commit: 32cb50835e7258625afff562939872be002232f2
Parents: e0559f2
Author: Mukul Murthy 
Authored: Tue Jul 10 11:08:04 2018 -0700
Committer: Tathagata Das 
Committed: Tue Jul 10 11:08:04 2018 -0700

--
 .../analysis/UnsupportedOperationChecker.scala  |   6 +-
 .../spark/sql/execution/SparkStrategies.scala   |  26 -
 .../streaming/IncrementalExecution.scala|  11 +-
 .../streaming/StreamingGlobalLimitExec.scala| 102 ++
 .../spark/sql/execution/streaming/memory.scala  |  70 ++--
 .../execution/streaming/sources/memoryV2.scala  |  44 ++--
 .../spark/sql/streaming/DataStreamWriter.scala  |   4 +-
 .../execution/streaming/MemorySinkSuite.scala   |  62 +--
 .../execution/streaming/MemorySinkV2Suite.scala |  80 +-
 .../spark/sql/streaming/StreamSuite.scala   | 108 +++
 .../apache/spark/sql/streaming/StreamTest.scala |   4 +-
 11 files changed, 272 insertions(+), 245 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/32cb5083/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
index 5ced1ca..f68df5d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
@@ -315,8 +315,10 @@ object UnsupportedOperationChecker {
 case GroupingSets(_, _, child, _) if child.isStreaming =>
   throwError("GroupingSets is not supported on streaming 
DataFrames/Datasets")
 
-case GlobalLimit(_, _) | LocalLimit(_, _) if 
subPlan.children.forall(_.isStreaming) =>
-  throwError("Limits are not supported on streaming 
DataFrames/Datasets")
+case GlobalLimit(_, _) | LocalLimit(_, _)
+if subPlan.children.forall(_.isStreaming) && outputMode == 
InternalOutputModes.Update =>
+  throwError("Limits are not supported on streaming 
DataFrames/Datasets in Update " +
+"output mode")
 
 case Sort(_, _, _) if !containsCompleteData(subPlan) =>
   throwError("Sorting is not supported on streaming 
DataFrames/Datasets, unless it is on " +

http://git-wip-us.apache.org/repos/asf/spark/blob/32cb5083/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
index cfbcb9a..02e095b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
@@ -27,6 +27,7 @@ import org.apache.spark.sql.catalyst.planning._
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.catalyst.streaming.InternalOutputModes
 import org.apache.spark.sql.execution.columnar.{InMemoryRelation, 
InMemoryTableScanExec}
 import org.apache.spark.sql.execution.command._
 import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
@@ -34,7 +35,7 @@ import org.apache.spark.sql.execution.joins.{BuildLeft, 
BuildRight, BuildSide}
 import org.apache.spark.sql.execution.streaming._
 import org.apache.spark.sql.execution.streaming.sources.MemoryPlanV2
 import 

svn commit: r28035 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_10_08_01-6fe3286-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Tue Jul 10 15:16:05 2018
New Revision: 28035

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_10_08_01-6fe3286 docs


[This commit notification would consist of 1467 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21743][SQL][FOLLOWUP] free aggregate map when task ends

2018-07-10 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 6fe32869c -> e0559f238


[SPARK-21743][SQL][FOLLOWUP] free aggregate map when task ends

## What changes were proposed in this pull request?

This is the first follow-up of https://github.com/apache/spark/pull/21573 , 
which was only merged to 2.3.

This PR fixes the memory leak in another way: free the `UnsafeExternalMap` when 
the task ends. All the data buffers in Spark SQL are using `UnsafeExternalMap` 
and `UnsafeExternalSorter` under the hood, e.g. sort, aggregate, window, SMJ, 
etc. `UnsafeExternalSorter` registers a task completion listener to free the 
resource, we should apply the same thing to `UnsafeExternalMap`.

TODO in the next PR:
do not consume all the inputs when having limit in whole stage codegen.

## How was this patch tested?

existing tests

Author: Wenchen Fan 

Closes #21738 from cloud-fan/limit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0559f23
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0559f23
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0559f23

Branch: refs/heads/master
Commit: e0559f238009e02c40f65678fec691c07904e8c0
Parents: 6fe3286
Author: Wenchen Fan 
Authored: Tue Jul 10 23:07:10 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Jul 10 23:07:10 2018 +0800

--
 .../UnsafeFixedWidthAggregationMap.java | 17 +++-
 .../spark/sql/execution/SparkStrategies.scala   |  7 +--
 .../execution/aggregate/HashAggregateExec.scala |  2 +-
 .../aggregate/TungstenAggregationIterator.scala |  2 +-
 .../UnsafeFixedWidthAggregationMapSuite.scala   | 21 
 5 files changed, 28 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e0559f23/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
index c7c4c7b..c8cf44b 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
@@ -20,8 +20,8 @@ package org.apache.spark.sql.execution;
 import java.io.IOException;
 
 import org.apache.spark.SparkEnv;
+import org.apache.spark.TaskContext;
 import org.apache.spark.internal.config.package$;
-import org.apache.spark.memory.TaskMemoryManager;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
@@ -82,7 +82,7 @@ public final class UnsafeFixedWidthAggregationMap {
* @param emptyAggregationBuffer the default value for new keys (a "zero" of 
the agg. function)
* @param aggregationBufferSchema the schema of the aggregation buffer, used 
for row conversion.
* @param groupingKeySchema the schema of the grouping key, used for row 
conversion.
-   * @param taskMemoryManager the memory manager used to allocate our Unsafe 
memory structures.
+   * @param taskContext the current task context.
* @param initialCapacity the initial capacity of the map (a sizing hint to 
avoid re-hashing).
* @param pageSizeBytes the data page size, in bytes; limits the maximum 
record size.
*/
@@ -90,19 +90,26 @@ public final class UnsafeFixedWidthAggregationMap {
   InternalRow emptyAggregationBuffer,
   StructType aggregationBufferSchema,
   StructType groupingKeySchema,
-  TaskMemoryManager taskMemoryManager,
+  TaskContext taskContext,
   int initialCapacity,
   long pageSizeBytes) {
 this.aggregationBufferSchema = aggregationBufferSchema;
 this.currentAggregationBuffer = new 
UnsafeRow(aggregationBufferSchema.length());
 this.groupingKeyProjection = UnsafeProjection.create(groupingKeySchema);
 this.groupingKeySchema = groupingKeySchema;
-this.map =
-  new BytesToBytesMap(taskMemoryManager, initialCapacity, pageSizeBytes, 
true);
+this.map = new BytesToBytesMap(
+  taskContext.taskMemoryManager(), initialCapacity, pageSizeBytes, true);
 
 // Initialize the buffer for aggregation value
 final UnsafeProjection valueProjection = 
UnsafeProjection.create(aggregationBufferSchema);
 this.emptyAggregationBuffer = 
valueProjection.apply(emptyAggregationBuffer).getBytes();
+
+// Register a cleanup task with TaskContext to ensure that memory is 
guaranteed to be freed at
+// the end of the task. This is necessary to avoid memory leaks in when 
the downstream operator
+// does not fully consume the aggregation 

spark git commit: [SPARK-24678][SPARK-STREAMING] Give priority in use of 'PROCESS_LOCAL' for spark-streaming

2018-07-10 Thread jshao
Repository: spark
Updated Branches:
  refs/heads/master a28900956 -> 6fe32869c


[SPARK-24678][SPARK-STREAMING] Give priority in use of 'PROCESS_LOCAL' for 
spark-streaming

## What changes were proposed in this pull request?

Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
which results in subsequent schedule level is not better than 'NODE_LOCAL'. We 
can just make a small changes, the schedule level can be improved to 
'PROCESS_LOCAL'

## How was this patch tested?

manual test

Author: sharkdtu 

Closes #21658 from sharkdtu/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6fe32869
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6fe32869
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6fe32869

Branch: refs/heads/master
Commit: 6fe32869ccb17933e77a4dbe883e36d382fbeeec
Parents: a289009
Author: sharkdtu 
Authored: Tue Jul 10 20:18:34 2018 +0800
Committer: jerryshao 
Committed: Tue Jul 10 20:18:34 2018 +0800

--
 .../src/main/scala/org/apache/spark/rdd/BlockRDD.scala |  2 +-
 .../scala/org/apache/spark/storage/BlockManager.scala  |  7 +--
 .../org/apache/spark/storage/BlockManagerSuite.scala   | 13 +
 3 files changed, 19 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6fe32869/core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala
index 4e036c2..23cf19d 100644
--- a/core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala
@@ -30,7 +30,7 @@ private[spark]
 class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: 
Array[BlockId])
   extends RDD[T](sc, Nil) {
 
-  @transient lazy val _locations = BlockManager.blockIdsToHosts(blockIds, 
SparkEnv.get)
+  @transient lazy val _locations = BlockManager.blockIdsToLocations(blockIds, 
SparkEnv.get)
   @volatile private var _isValid = true
 
   override def getPartitions: Array[Partition] = {

http://git-wip-us.apache.org/repos/asf/spark/blob/6fe32869/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
index df1a4be..0e1c7d5 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
@@ -45,6 +45,7 @@ import org.apache.spark.network.netty.SparkTransportConf
 import org.apache.spark.network.shuffle.{ExternalShuffleClient, 
TempFileManager}
 import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo
 import org.apache.spark.rpc.RpcEnv
+import org.apache.spark.scheduler.ExecutorCacheTaskLocation
 import org.apache.spark.serializer.{SerializerInstance, SerializerManager}
 import org.apache.spark.shuffle.ShuffleManager
 import org.apache.spark.storage.memory._
@@ -1554,7 +1555,7 @@ private[spark] class BlockManager(
 private[spark] object BlockManager {
   private val ID_GENERATOR = new IdGenerator
 
-  def blockIdsToHosts(
+  def blockIdsToLocations(
   blockIds: Array[BlockId],
   env: SparkEnv,
   blockManagerMaster: BlockManagerMaster = null): Map[BlockId, 
Seq[String]] = {
@@ -1569,7 +1570,9 @@ private[spark] object BlockManager {
 
 val blockManagers = new HashMap[BlockId, Seq[String]]
 for (i <- 0 until blockIds.length) {
-  blockManagers(blockIds(i)) = blockLocations(i).map(_.host)
+  blockManagers(blockIds(i)) = blockLocations(i).map { loc =>
+ExecutorCacheTaskLocation(loc.host, loc.executorId).toString
+  }
 }
 blockManagers.toMap
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/6fe32869/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala 
b/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
index b19d8eb..08172f0 100644
--- a/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
@@ -1422,6 +1422,19 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(mockBlockTransferService.tempFileManager === 
store.remoteBlockTempFileManager)
   }
 
+  test("query locations of blockIds") {
+val mockBlockManagerMaster = mock(classOf[BlockManagerMaster])
+val blockLocations = Seq(BlockManagerId("1", 

svn commit: r28023 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_10_04_01-a289009-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Tue Jul 10 11:17:52 2018
New Revision: 28023

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_10_04_01-a289009 docs


[This commit notification would consist of 1467 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24706][SQL] ByteType and ShortType support pushdown to parquet

2018-07-10 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 4984f1af7 -> a28900956


[SPARK-24706][SQL] ByteType and ShortType support pushdown to parquet

## What changes were proposed in this pull request?

`ByteType` and `ShortType` support pushdown to parquet data source.
[Benchmark 
result](https://issues.apache.org/jira/browse/SPARK-24706?focusedCommentId=16528878=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16528878).

## How was this patch tested?

unit tests

Author: Yuming Wang 

Closes #21682 from wangyum/SPARK-24706.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a2890095
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a2890095
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a2890095

Branch: refs/heads/master
Commit: a289009567c1566a1df4bcdfdf0111e82ae3d81d
Parents: 4984f1a
Author: Yuming Wang 
Authored: Tue Jul 10 15:58:14 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Jul 10 15:58:14 2018 +0800

--
 .../FilterPushdownBenchmark-results.txt | 32 +--
 .../datasources/parquet/ParquetFilters.scala| 34 +++-
 .../parquet/ParquetFilterSuite.scala| 56 
 3 files changed, 94 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a2890095/sql/core/benchmarks/FilterPushdownBenchmark-results.txt
--
diff --git a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt 
b/sql/core/benchmarks/FilterPushdownBenchmark-results.txt
index 29fe434..110669b 100644
--- a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt
+++ b/sql/core/benchmarks/FilterPushdownBenchmark-results.txt
@@ -542,39 +542,39 @@ Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 
 Select 1 tinyint row (value = CAST(63 AS tinyint)): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized3726 / 3775  4.2 
236.9   1.0X
-Parquet Vectorized (Pushdown) 3741 / 3789  4.2 
237.9   1.0X
-Native ORC Vectorized 2793 / 2909  5.6 
177.6   1.3X
-Native ORC Vectorized (Pushdown)   530 /  561 29.7 
 33.7   7.0X
+Parquet Vectorized3461 / 3997  4.5 
220.1   1.0X
+Parquet Vectorized (Pushdown)  270 /  315 58.4 
 17.1  12.8X
+Native ORC Vectorized 4107 / 5372  3.8 
261.1   0.8X
+Native ORC Vectorized (Pushdown)   778 / 1553 20.2 
 49.5   4.4X
 
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 
 Select 10% tinyint rows (value < CAST(12 AS tinyint)): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized4385 / 4406  3.6 
278.8   1.0X
-Parquet Vectorized (Pushdown) 4398 / 4454  3.6 
279.6   1.0X
-Native ORC Vectorized 3420 / 3501  4.6 
217.4   1.3X
-Native ORC Vectorized (Pushdown)  1395 / 1432 11.3 
 88.7   3.1X
+Parquet Vectorized4771 / 6655  3.3 
303.3   1.0X
+Parquet Vectorized (Pushdown) 1322 / 1606 11.9 
 84.0   3.6X
+Native ORC Vectorized 4437 / 4572  3.5 
282.1   1.1X
+Native ORC Vectorized (Pushdown)  1781 / 1976  8.8 
113.2   2.7X
 
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 
 Select 50% tinyint rows (value < CAST(63 AS tinyint)): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized7307 / 7394  2.2 
464.6   1.0X
-Parquet Vectorized (Pushdown) 7411 / 7461  2.1 
471.2   1.0X
-Native ORC Vectorized 6501 / 7814  2.4 
413.4   1.1X
-Native ORC Vectorized (Pushdown)  7341 / 8637  2.1 
466.7   1.0X
+Parquet Vectorized7433 / 7752  2.1 
472.6   1.0X
+Parquet Vectorized (Pushdown) 5863 / 5913  2.7 
372.8   

svn commit: r28019 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_10_00_01-4984f1a-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-10 Thread pwendell
Author: pwendell
Date: Tue Jul 10 07:16:39 2018
New Revision: 28019

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_10_00_01-4984f1a docs


[This commit notification would consist of 1467 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org