date:20171115

spark git commit: [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files

2017-11-15 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 1e6f76059 -> dce1610ae


[SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual 
files

## What changes were proposed in this pull request?

Logically the `Array` doesn't belong to `ColumnVector`, and `Row` doesn't 
belong to `ColumnarBatch`. e.g. `ColumnVector` needs to return `Array` for 
`getArray`, and `Row` for `getStruct`. `Array` and `Row` can return each other 
with the `getArray`/`getStruct` methods.

This is also a step to make `ColumnVector` public, it's cleaner to have `Array` 
and `Row` as top-level classes.

This PR is just code moving around, with 2 renaming: `Array` -> 
`VectorBasedArray`, `Row` -> `VectorBasedRow`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #19740 from cloud-fan/vector.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dce1610a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dce1610a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dce1610a

Branch: refs/heads/master
Commit: dce1610ae376af00712ba7f4c99bfb4c006dbaec
Parents: 1e6f760
Author: Wenchen Fan 
Authored: Wed Nov 15 14:42:37 2017 +0100
Committer: Wenchen Fan 
Committed: Wed Nov 15 14:42:37 2017 +0100

--
 .../execution/vectorized/AggregateHashMap.java  |   2 +-
 .../execution/vectorized/ArrowColumnVector.java |   6 +-
 .../sql/execution/vectorized/ColumnVector.java  | 202 +---
 .../execution/vectorized/ColumnVectorUtils.java |   2 +-
 .../sql/execution/vectorized/ColumnarArray.java | 208 
 .../sql/execution/vectorized/ColumnarBatch.java | 326 +-
 .../sql/execution/vectorized/ColumnarRow.java   | 327 +++
 .../vectorized/OffHeapColumnVector.java |   2 +-
 .../vectorized/OnHeapColumnVector.java  |   2 +-
 .../vectorized/WritableColumnVector.java|  14 +-
 .../execution/aggregate/HashAggregateExec.scala |  10 +-
 .../aggregate/VectorizedHashMapGenerator.scala  |  12 +-
 .../vectorized/ColumnVectorSuite.scala  |  40 +--
 13 files changed, 597 insertions(+), 556 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dce1610a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
index cb3ad4e..9467435 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
@@ -72,7 +72,7 @@ public class AggregateHashMap {
 this(schema, DEFAULT_CAPACITY, DEFAULT_LOAD_FACTOR, DEFAULT_MAX_STEPS);
   }
 
-  public ColumnarBatch.Row findOrInsert(long key) {
+  public ColumnarRow findOrInsert(long key) {
 int idx = find(key);
 if (idx != -1 && buckets[idx] == -1) {
   columnVectors[0].putLong(numRows, key);

http://git-wip-us.apache.org/repos/asf/spark/blob/dce1610a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
index 51ea719..949035b 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
@@ -251,7 +251,7 @@ public final class ArrowColumnVector extends ColumnVector {
   }
 
   @Override
-  public void loadBytes(ColumnVector.Array array) {
+  public void loadBytes(ColumnarArray array) {
 throw new UnsupportedOperationException();
   }
 
@@ -330,7 +330,7 @@ public final class ArrowColumnVector extends ColumnVector {
 
   childColumns = new ArrowColumnVector[1];
   childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
-  resultArray = new ColumnVector.Array(childColumns[0]);
+  resultArray = new ColumnarArray(childColumns[0]);
 } else if (vector instanceof MapVector) {
   MapVector mapVector = (MapVector) vector;
   accessor = new StructAccessor(mapVector);
@@ -339,7 +339,7 @@ public final class ArrowColumnVector extends ColumnVector {
   for (int i = 0; i < childColumns.length; ++i) {
 childColumns[i] = new ArrowColumnVector(mapVector.getVectorById(i));
   }
-  resultStruct = new ColumnarBatch.Row(childColumns);
+  resultStruct = new ColumnarRow(childColumns);
 } else

spark git commit: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow

2017-11-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master dce1610ae -> 8f0e88df0


[SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in 
createDataFrame with Arrow

## What changes were proposed in this pull request?

If schema is passed as a list of unicode strings for column names, they should 
be re-encoded to 'utf-8' to be consistent.  This is similar to the #13097 but 
for creation of DataFrame using Arrow.

## How was this patch tested?

Added new test of using unicode names for schema.

Author: Bryan Cutler 

Closes #19738 from 
BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8f0e88df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8f0e88df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8f0e88df

Branch: refs/heads/master
Commit: 8f0e88df03a06a91bb61c6e0d69b1b19e2bfb3f7
Parents: dce1610
Author: Bryan Cutler 
Authored: Wed Nov 15 23:35:13 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Nov 15 23:35:13 2017 +0900

--
 python/pyspark/sql/session.py |  7 ---
 python/pyspark/sql/tests.py   | 10 ++
 2 files changed, 14 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8f0e88df/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 589365b..dbbcfff 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -592,6 +592,9 @@ class SparkSession(object):
 
 if isinstance(schema, basestring):
 schema = _parse_datatype_string(schema)
+elif isinstance(schema, (list, tuple)):
+# Must re-encode any unicode strings to be consistent with 
StructField names
+schema = [x.encode('utf-8') if not isinstance(x, str) else x for x 
in schema]
 
 try:
 import pandas
@@ -602,7 +605,7 @@ class SparkSession(object):
 
 # If no schema supplied by user then get the names of columns only
 if schema is None:
-schema = [str(x) for x in data.columns]
+schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in data.columns]
 
 if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
 and len(data) > 0:
@@ -630,8 +633,6 @@ class SparkSession(object):
 verify_func(obj)
 return obj,
 else:
-if isinstance(schema, (list, tuple)):
-schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in schema]
 prepare = lambda obj: obj
 
 if isinstance(data, RDD):

http://git-wip-us.apache.org/repos/asf/spark/blob/8f0e88df/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 6356d93..ef592c2 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3225,6 +3225,16 @@ class ArrowTests(ReusedSQLTestCase):
 df = self.spark.createDataFrame(pdf, schema=tuple('abcdefg'))
 self.assertEquals(df.schema.fieldNames(), list('abcdefg'))
 
+def test_createDataFrame_column_name_encoding(self):
+import pandas as pd
+pdf = pd.DataFrame({u'a': [1]})
+columns = self.spark.createDataFrame(pdf).columns
+self.assertTrue(isinstance(columns[0], str))
+self.assertEquals(columns[0], 'a')
+columns = self.spark.createDataFrame(pdf, [u'b']).columns
+self.assertTrue(isinstance(columns[0], str))
+self.assertEquals(columns[0], 'b')
+
 def test_createDataFrame_with_single_data_type(self):
 import pandas as pd
 with QuietTest(self.sc):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: And update the site directory

2017-11-15 Thread holden

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 853627da6 -> d8159e6c0


And update the site directory


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/d8159e6c
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/d8159e6c
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/d8159e6c

Branch: refs/heads/asf-site
Commit: d8159e6c08e1684d52f86cab3b0840ae4f3b1d6f
Parents: 853627d
Author: Holden Karau 
Authored: Wed Nov 15 06:56:04 2017 -0800
Committer: Holden Karau 
Committed: Wed Nov 15 07:14:05 2017 -0800

--
 site/js/downloads.js | 8 
 1 file changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/d8159e6c/site/js/downloads.js
--
diff --git a/site/js/downloads.js b/site/js/downloads.js
index 675d24c..a909362 100644
--- a/site/js/downloads.js
+++ b/site/js/downloads.js
@@ -99,6 +99,14 @@ function initReleaseNotes() {
   }
 }
 
+function onPackageSelect() {
+  var packageSelect = document.getElementById("sparkPackageSelect");
+
+  var pkg = getSelectedValue(packageSelect);
+
+  updateDownloadLink();
+}
+
 function onVersionSelect() {
   var versionSelect = document.getElementById("sparkVersionSelect");
   var packageSelect = document.getElementById("sparkPackageSelect");


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Reduce generated HTML differences

2017-11-15 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site d8159e6c0 -> de3e0a792


Reduce generated HTML differences


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/de3e0a79
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/de3e0a79
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/de3e0a79

Branch: refs/heads/asf-site
Commit: de3e0a7920f1da9e35acc38000ab0ae8c04b
Parents: d8159e6
Author: hyukjinkwon 
Authored: Sat Oct 28 19:20:48 2017 +0900
Committer: Sean Owen 
Committed: Wed Nov 15 09:53:19 2017 -0600

--
 committers.md  | 3 +++
 contributing.md| 1 +
 security.md| 6 ++
 site/committers.html   | 3 +++
 site/contributing.html | 1 +
 site/security.html | 6 ++
 6 files changed, 20 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/de3e0a79/committers.md
--
diff --git a/committers.md b/committers.md
index 0c3b40f..90d09ac 100644
--- a/committers.md
+++ b/committers.md
@@ -156,6 +156,7 @@ The trade off when backporting is you get to deliver the 
fix to people running o
 The decision point is when you have a bug fix and it's not clear whether it is 
worth backporting.
 
 I think the following facets are important to consider:
+
 - Backports are an extremely valuable service to the community and should be 
considered for 
 any bug fix.
 - Introducing a new bug in a maintenance release must be avoided at all costs. 
It over time would 
@@ -163,11 +164,13 @@ erode confidence in our release process.
 - Distributions or advanced users can always backport risky patches on their 
own, if they see fit.
 
 For me, the consequence of these is that we should backport in the following 
situations:
+
 - Both the bug and the fix are well understood and isolated. Code being 
modified is well tested.
 - The bug being addressed is high priority to the community.
 - The backported fix does not vary widely from the master branch fix.
 
 We tend to avoid backports in the converse situations:
+
 - The bug or fix are not well understood. For instance, it relates to 
interactions between complex 
 components or third party libraries (e.g. Hadoop libraries). The code is not 
well tested outside 
 of the immediate bug being fixed.

http://git-wip-us.apache.org/repos/asf/spark-website/blob/de3e0a79/contributing.md
--
diff --git a/contributing.md b/contributing.md
index 85fa478..c995f5b 100644
--- a/contributing.md
+++ b/contributing.md
@@ -449,6 +449,7 @@ For inline comment with the code, use `//` and not `/*  .. 
*/`.
 Always import packages using absolute paths (e.g. `scala.util.Random`) instead 
of relative ones 
 (e.g. `util.Random`). In addition, sort imports in the following order 
 (use alphabetical order within each group):
+
 - `java.*` and `javax.*`
 - `scala.*`
 - Third-party libraries (`org.*`, `com.*`, etc)

http://git-wip-us.apache.org/repos/asf/spark-website/blob/de3e0a79/security.md
--
diff --git a/security.md b/security.md
index 0a375a7..fd1fe46 100644
--- a/security.md
+++ b/security.md
@@ -42,6 +42,7 @@ Mitigation:
 Update to Apache Spark 2.1.2, 2.2.0 or later.
 
 Credit:
+
 - Aditya Sharad, Semmle
 
 CVE-2017-7678 Apache Spark XSS web UI MHTML 
vulnerability
@@ -63,6 +64,7 @@ Update to Apache Spark 2.1.2, 2.2.0 or later.
 
 Example:
 Request:
+
 ```
 GET /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
 _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
@@ -71,6 +73,7 @@ HTTP/1.1
 ```
 
 Excerpt from response:
+
 ```
 No running application with ID Content-Type: 
multipart/related;
 boundary=_AppScan
@@ -80,11 +83,14 @@ Content-Transfer-Encoding:base64
 PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
 
 ```
+
 Result: In the above payload the BASE64 data decodes as:
+
 ```
 alert("XSS")
 ```
 
 Credit:
+
 - Mike Kasper, Nicholas Marion
 - IBM z Systems Center for Secure Engineering

http://git-wip-us.apache.org/repos/asf/spark-website/blob/de3e0a79/site/committers.html
--
diff --git a/site/committers.html b/site/committers.html
index e55622a..aa2e2c5 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -516,6 +516,7 @@ follow-up can be well communicated to all Spark developers.
 The decision point is when you have a bug fix and it’s not clear whether 
it is worth backporting.
 
 I think the following facets are important to consider:
+
 
   Backports are an extremely valuable service to the community and should 
be considered for 
 any bug fix.
@@ -525,6 +526,7 @@ erode con

spark git commit: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

2017-11-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 8f0e88df0 -> 7f99a05e6


[SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

## What changes were proposed in this pull request?

I added adjusted R2 as a regression metric which was implemented in all major 
statistical analysis tools.

In practice, no one looks at R2 alone. The reason is R2 itself is misleading. 
If we add more parameters, R2 will not decrease but only increase (or stay the 
same). This leads to overfitting. Adjusted R2 addressed this issue by using 
number of parameters as "weight" for the sum of errors.

## How was this patch tested?

- Added a new unit test and passed.
- ./dev/run-tests all passed.

Author: test 
Author: tengpeng 

Closes #19638 from tengpeng/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f99a05e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f99a05e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f99a05e

Branch: refs/heads/master
Commit: 7f99a05e6ff258fc2192130451aa8aa1304bfe93
Parents: 8f0e88d
Author: test 
Authored: Wed Nov 15 10:13:01 2017 -0600
Committer: Sean Owen 
Committed: Wed Nov 15 10:13:01 2017 -0600

--
 .../spark/ml/regression/LinearRegression.scala   | 15 +++
 .../spark/ml/regression/LinearRegressionSuite.scala  |  6 ++
 2 files changed, 21 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7f99a05e/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index df1aa60..da6bcf0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -722,6 +722,21 @@ class LinearRegressionSummary private[regression] (
   @Since("1.5.0")
   val r2: Double = metrics.r2
 
+  /**
+   * Returns Adjusted R^2^, the adjusted coefficient of determination.
+   * Reference: https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2";>
+   * Wikipedia coefficient of determination
+   *
+   * @note This ignores instance weights (setting all to 1.0) from 
`LinearRegression.weightCol`.
+   * This will change in later Spark versions.
+   */
+  @Since("2.3.0")
+  val r2adj: Double = {
+val interceptDOF = if (privateModel.getFitIntercept) 1 else 0
+1 - (1 - r2) * (numInstances - interceptDOF) /
+  (numInstances - privateModel.coefficients.size - interceptDOF)
+  }
+
   /** Residuals (label - predicted value) */
   @Since("1.5.0")
   @transient lazy val residuals: DataFrame = {

http://git-wip-us.apache.org/repos/asf/spark/blob/7f99a05e/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
index f470dca..0e0be58 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
@@ -764,6 +764,11 @@ class LinearRegressionSuite
   (Intercept) 6.3022157  0.00186003388   <2e-16 ***
   V2  4.6982442  0.00118053980   <2e-16 ***
   V3  7.1994344  0.00090447961   <2e-16 ***
+
+  # R code for r2adj
+  lm_fit <- lm(V1 ~ V2 + V3, data = d1)
+  summary(lm_fit)$adj.r.squared
+  [1] 0.9998736
   ---
 
   
@@ -771,6 +776,7 @@ class LinearRegressionSuite
   assert(model.summary.meanSquaredError ~== 0.00985449 relTol 1E-4)
   assert(model.summary.meanAbsoluteError ~== 0.07961668 relTol 1E-4)
   assert(model.summary.r2 ~== 0.9998737 relTol 1E-4)
+  assert(model.summary.r2adj ~== 0.9998736  relTol 1E-4)
 
   // Normal solver uses "WeightedLeastSquares". If no regularization is 
applied or only L2
   // regularization is applied, this algorithm uses a direct solver and 
does not generate an


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

2017-11-15 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 7f99a05e6 -> aa88b8dbb


[SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

## What changes were proposed in this pull request?

In PySpark API Document, 
[SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html)
 is not documented and shows default value description.
```
SparkSession.builder = https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png)

The following is the diff of the generated result.

```
$ diff old.html new.html
95a96,101
> 
> 
> builder href="#pyspark.sql.SparkSession.builder" title="Permalink to this 
> definition">Â¶
> A class attribute having a  href="#pyspark.sql.SparkSession.Builder" 
> title="pyspark.sql.SparkSession.Builder">Builder to construct  class="reference internal" href="#pyspark.sql.SparkSession" 
> title="pyspark.sql.SparkSession">SparkSession instances
> 
>
212,216d217
< 
< builder = 
Â¶
< 
<
< 
```

## How was this patch tested?

Manual.

```
cd python/docs
make html
open _build/html/pyspark.sql.html
```

Author: Dongjoon Hyun 

Closes #19726 from dongjoon-hyun/SPARK-22490.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa88b8db
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa88b8db
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa88b8db

Branch: refs/heads/master
Commit: aa88b8dbbb7e71b282f31ae775140c783e83b4d6
Parents: 7f99a05
Author: Dongjoon Hyun 
Authored: Wed Nov 15 08:59:29 2017 -0800
Committer: gatorsmile 
Committed: Wed Nov 15 08:59:29 2017 -0800

--
 python/docs/pyspark.sql.rst   | 3 +++
 python/pyspark/sql/session.py | 4 
 2 files changed, 7 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa88b8db/python/docs/pyspark.sql.rst
--
diff --git a/python/docs/pyspark.sql.rst b/python/docs/pyspark.sql.rst
index 09848b8..5c3b7e2 100644
--- a/python/docs/pyspark.sql.rst
+++ b/python/docs/pyspark.sql.rst
@@ -7,6 +7,9 @@ Module Context
 .. automodule:: pyspark.sql
 :members:
 :undoc-members:
+:exclude-members: builder
+.. We need `exclude-members` to prevent default description generations
+   as a workaround for old Sphinx (< 1.6.6).
 
 pyspark.sql.types module
 

http://git-wip-us.apache.org/repos/asf/spark/blob/aa88b8db/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index dbbcfff..47c58bb 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -72,6 +72,9 @@ class SparkSession(object):
 ... .appName("Word Count") \\
 ... .config("spark.some.config.option", "some-value") \\
 ... .getOrCreate()
+
+.. autoattribute:: builder
+   :annotation:
 """
 
 class Builder(object):
@@ -183,6 +186,7 @@ class SparkSession(object):
 return session
 
 builder = Builder()
+"""A class attribute having a :class:`Builder` to construct 
:class:`SparkSession` instances"""
 
 _instantiatedSession = None
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

2017-11-15 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 210f2922b -> 3cefddee5


[SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

## What changes were proposed in this pull request?

In PySpark API Document, 
[SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html)
 is not documented and shows default value description.
```
SparkSession.builder = https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png)

The following is the diff of the generated result.

```
$ diff old.html new.html
95a96,101
> 
> 
> builder href="#pyspark.sql.SparkSession.builder" title="Permalink to this 
> definition">Â¶
> A class attribute having a  href="#pyspark.sql.SparkSession.Builder" 
> title="pyspark.sql.SparkSession.Builder">Builder to construct  class="reference internal" href="#pyspark.sql.SparkSession" 
> title="pyspark.sql.SparkSession">SparkSession instances
> 
>
212,216d217
< 
< builder = 
Â¶
< 
<
< 
```

## How was this patch tested?

Manual.

```
cd python/docs
make html
open _build/html/pyspark.sql.html
```

Author: Dongjoon Hyun 

Closes #19726 from dongjoon-hyun/SPARK-22490.

(cherry picked from commit aa88b8dbbb7e71b282f31ae775140c783e83b4d6)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3cefddee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3cefddee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3cefddee

Branch: refs/heads/branch-2.2
Commit: 3cefddee5cc4e497d750b6394cb4bb6f7e524dbd
Parents: 210f2922
Author: Dongjoon Hyun 
Authored: Wed Nov 15 08:59:29 2017 -0800
Committer: gatorsmile 
Committed: Wed Nov 15 09:00:39 2017 -0800

--
 python/docs/pyspark.sql.rst   | 3 +++
 python/pyspark/sql/session.py | 4 
 2 files changed, 7 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3cefddee/python/docs/pyspark.sql.rst
--
diff --git a/python/docs/pyspark.sql.rst b/python/docs/pyspark.sql.rst
index 09848b8..5c3b7e2 100644
--- a/python/docs/pyspark.sql.rst
+++ b/python/docs/pyspark.sql.rst
@@ -7,6 +7,9 @@ Module Context
 .. automodule:: pyspark.sql
 :members:
 :undoc-members:
+:exclude-members: builder
+.. We need `exclude-members` to prevent default description generations
+   as a workaround for old Sphinx (< 1.6.6).
 
 pyspark.sql.types module
 

http://git-wip-us.apache.org/repos/asf/spark/blob/3cefddee/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 4d14ae0..d661b5d 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -72,6 +72,9 @@ class SparkSession(object):
 ... .appName("Word Count") \\
 ... .config("spark.some.config.option", "some-value") \\
 ... .getOrCreate()
+
+.. autoattribute:: builder
+   :annotation:
 """
 
 class Builder(object):
@@ -183,6 +186,7 @@ class SparkSession(object):
 return session
 
 builder = Builder()
+"""A class attribute having a :class:`Builder` to construct 
:class:`SparkSession` instances"""
 
 _instantiatedSession = None
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric

2017-11-15 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master aa88b8dbb -> bc0848b4c


[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric

## What changes were proposed in this pull request?
This fixes a problem caused by #15880
`select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
`
When compare string and numeric, cast them as double like Hive.

Author: liutang123 

Closes #19692 from liutang123/SPARK-22469.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc0848b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc0848b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc0848b4

Branch: refs/heads/master
Commit: bc0848b4c1ab84ccef047363a70fd11df240dbbf
Parents: aa88b8d
Author: liutang123 
Authored: Wed Nov 15 09:02:54 2017 -0800
Committer: gatorsmile 
Committed: Wed Nov 15 09:02:54 2017 -0800

--
 .../sql/catalyst/analysis/TypeCoercion.scala|   7 +
 .../catalyst/analysis/TypeCoercionSuite.scala   |   3 +
 .../sql-tests/inputs/predicate-functions.sql|   5 +
 .../results/predicate-functions.sql.out | 140 ---
 4 files changed, 105 insertions(+), 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bc0848b4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 532d22d..074eda5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -137,6 +137,13 @@ object TypeCoercion {
 case (DateType, TimestampType) => Some(StringType)
 case (StringType, NullType) => Some(StringType)
 case (NullType, StringType) => Some(StringType)
+
+// There is no proper decimal type we can pick,
+// using double type is the best we can do.
+// See SPARK-22469 for details.
+case (n: DecimalType, s: StringType) => Some(DoubleType)
+case (s: StringType, n: DecimalType) => Some(DoubleType)
+
 case (l: StringType, r: AtomicType) if r != StringType => Some(r)
 case (l: AtomicType, r: StringType) if (l != StringType) => Some(l)
 case (l, r) => None

http://git-wip-us.apache.org/repos/asf/spark/blob/bc0848b4/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
index 793e04f..5dcd653 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
@@ -1152,6 +1152,9 @@ class TypeCoercionSuite extends AnalysisTest {
 ruleTest(PromoteStrings,
   EqualTo(Literal(Array(1, 2)), Literal("123")),
   EqualTo(Literal(Array(1, 2)), Literal("123")))
+ruleTest(PromoteStrings,
+  GreaterThan(Literal("1.5"), Literal(BigDecimal("0.5"))),
+  GreaterThan(Cast(Literal("1.5"), DoubleType), 
Cast(Literal(BigDecimal("0.5")), DoubleType)))
   }
 
   test("cast WindowFrame boundaries to the type they operate upon") {

http://git-wip-us.apache.org/repos/asf/spark/blob/bc0848b4/sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql
--
diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql 
b/sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql
index 3b3d4ad..e99d5ce 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql
@@ -2,12 +2,14 @@
 select 1 = 1;
 select 1 = '1';
 select 1.0 = '1';
+select 1.5 = '1.51';
 
 -- GreaterThan
 select 1 > '1';
 select 2 > '1.0';
 select 2 > '2.0';
 select 2 > '2.2';
+select '1.5' > 0.5;
 select to_date('2009-07-30 04:17:52') > to_date('2009-07-30 04:17:52');
 select to_date('2009-07-30 04:17:52') > '2009-07-30 04:17:52';
  
@@ -16,6 +18,7 @@ select 1 >= '1';
 select 2 >= '1.0';
 select 2 >= '2.0';
 select 2.0 >= '2.2';
+select '1.5' >= 0.5;
 select to_date('2009-07-30 04:17:52') >= to_date('2009-07-30 04:17:52');
 select to_date('2009-07-30 04:17:52') >= '2009-07-30 04:17:52';
  
@@ -24,6 +27,7 @@ select 1 < '1';
 select 2 < '1.0';
 select 2 < '2.0';
 select 2.0 < '2.2';
+selec

spark git commit: [SPARK-20649][CORE] Simplify REST API resource structure.

2017-11-15 Thread irashid

Repository: spark
Updated Branches:
  refs/heads/master bc0848b4c -> 39b3f10dd


[SPARK-20649][CORE] Simplify REST API resource structure.

With the new UI store, the API resource classes have a lot less code,
since there's no need for complicated translations between the UI
types and the API types. So the code ended up with a bunch of files
with a single method declared in them.

This change re-structures the API code so that it uses less classes;
mainly, most sub-resources were removed, and the code to deal with
single-attempt and multi-attempt apps was simplified.

The only change was the addition of a method to return a single
attempt's information; that was missing in the old API, so trying
to retrieve "/v1/applications/appId/attemptId" would result in a
404 even if the attempt existed (and URIs under that one would
return valid data).

The streaming API resources also overtook the same treatment, even
though the data is not stored in the new UI store.

Author: Marcelo Vanzin 

Closes #19748 from vanzin/SPARK-20649.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39b3f10d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39b3f10d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39b3f10d

Branch: refs/heads/master
Commit: 39b3f10dda73f4a1f735f17467e5c6c45c44e977
Parents: bc0848b
Author: Marcelo Vanzin 
Authored: Wed Nov 15 15:41:53 2017 -0600
Committer: Imran Rashid 
Committed: Wed Nov 15 15:41:53 2017 -0600

--
 .../status/api/v1/AllExecutorListResource.scala |  30 ---
 .../spark/status/api/v1/AllJobsResource.scala   |  35 
 .../spark/status/api/v1/AllRDDResource.scala|  31 ---
 .../spark/status/api/v1/AllStagesResource.scala |  33 ---
 .../spark/status/api/v1/ApiRootResource.scala   | 203 ++-
 .../api/v1/ApplicationEnvironmentResource.scala |  32 ---
 .../status/api/v1/ApplicationListResource.scala |   2 +-
 .../api/v1/EventLogDownloadResource.scala   |  71 ---
 .../status/api/v1/ExecutorListResource.scala|  30 ---
 .../status/api/v1/OneApplicationResource.scala  | 146 -
 .../spark/status/api/v1/OneJobResource.scala|  38 
 .../spark/status/api/v1/OneRDDResource.scala|  38 
 .../spark/status/api/v1/OneStageResource.scala  |  89 
 .../spark/status/api/v1/StagesResource.scala|  97 +
 .../spark/status/api/v1/VersionResource.scala   |  30 ---
 .../api/v1/streaming/AllBatchesResource.scala   |  78 ---
 .../streaming/AllOutputOperationsResource.scala |  66 --
 .../api/v1/streaming/AllReceiversResource.scala |  76 ---
 .../api/v1/streaming/ApiStreamingApp.scala  |  31 ++-
 .../v1/streaming/ApiStreamingRootResource.scala | 172 +---
 .../api/v1/streaming/OneBatchResource.scala |  35 
 .../streaming/OneOutputOperationResource.scala  |  39 
 .../api/v1/streaming/OneReceiverResource.scala  |  35 
 .../streaming/StreamingStatisticsResource.scala |  64 --
 24 files changed, 425 insertions(+), 1076 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39b3f10d/core/src/main/scala/org/apache/spark/status/api/v1/AllExecutorListResource.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/status/api/v1/AllExecutorListResource.scala
 
b/core/src/main/scala/org/apache/spark/status/api/v1/AllExecutorListResource.scala
deleted file mode 100644
index 5522f4c..000
--- 
a/core/src/main/scala/org/apache/spark/status/api/v1/AllExecutorListResource.scala
+++ /dev/null
@@ -1,30 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.spark.status.api.v1
-
-import javax.ws.rs.{GET, Produces}
-import javax.ws.rs.core.MediaType
-
-import org.apache.spark.ui.SparkUI
-
-@Produces(Array(MediaType.APPLICATION_JSON))
-private[v1] class AllExecutorListResource(ui: SparkUI) {
-
-  @GET
-  def executorList(): Seq[ExecutorSummary] = ui.store.executorList(false)
-
-}

http://git-wip-us.apache.org/repos/asf/spark/blob/39b3f10d/core/

spark git commit: [SPARK-22479][SQL] Exclude credentials from SaveintoDataSourceCommand.simpleString

2017-11-15 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 39b3f10dd -> 2014e7a78


[SPARK-22479][SQL] Exclude credentials from 
SaveintoDataSourceCommand.simpleString

## What changes were proposed in this pull request?

Do not include jdbc properties which may contain credentials in logging a 
logical plan with `SaveIntoDataSourceCommand` in it.

## How was this patch tested?

building locally and trying to reproduce (per the steps in 
https://issues.apache.org/jira/browse/SPARK-22479):
```
== Parsed Logical Plan ==
SaveIntoDataSourceCommand 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, 
Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> 
*(redacted), password -> *(redacted)), ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Analyzed Logical Plan ==
SaveIntoDataSourceCommand 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, 
Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> 
*(redacted), password -> *(redacted)), ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Optimized Logical Plan ==
SaveIntoDataSourceCommand 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, 
Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> 
*(redacted), password -> *(redacted)), ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Physical Plan ==
Execute SaveIntoDataSourceCommand
   +- SaveIntoDataSourceCommand 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider570127fa, 
Map(dbtable -> test20, driver -> org.postgresql.Driver, url -> 
*(redacted), password -> *(redacted)), ErrorIfExists
 +- Range (0, 100, step=1, splits=Some(8))
```

Author: osatici 

Closes #19708 from onursatici/os/redact-jdbc-creds.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2014e7a7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2014e7a7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2014e7a7

Branch: refs/heads/master
Commit: 2014e7a789d36e376ca62b1e24636d79c1b19745
Parents: 39b3f10
Author: osatici 
Authored: Wed Nov 15 14:08:51 2017 -0800
Committer: gatorsmile 
Committed: Wed Nov 15 14:08:51 2017 -0800

--
 .../apache/spark/internal/config/package.scala  |  2 +-
 .../datasources/SaveIntoDataSourceCommand.scala |  7 +++
 .../SaveIntoDataSourceCommandSuite.scala| 48 
 3 files changed, 56 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2014e7a7/core/src/main/scala/org/apache/spark/internal/config/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 57e2da8..84315f5 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -307,7 +307,7 @@ package object config {
 "a property key or value, the value is redacted from the environment 
UI and various logs " +
 "like YARN and event logs.")
   .regexConf
-  .createWithDefault("(?i)secret|password".r)
+  .createWithDefault("(?i)secret|password|url|user|username".r)
 
   private[spark] val STRING_REDACTION_PATTERN =
 ConfigBuilder("spark.redaction.string.regex")

http://git-wip-us.apache.org/repos/asf/spark/blob/2014e7a7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
index 96c84ea..568e953 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
@@ -17,11 +17,13 @@
 
 package org.apache.spark.sql.execution.datasources
 
+import org.apache.spark.SparkEnv
 import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.execution.command.RunnableCommand
 import org.apache.spark.sql.sources.CreatableRelationProvider
+import org.apache.spark.util.Utils
 
 /**
  * Saves the results of `query` in to a data source.
@@ -46,4 +48,9 @@ case class SaveIntoDataSourceCommand(
 
 Seq.empty[Row]
   }
+
+  override def simpleString: String = {

spark git commit: [SPARK-21842][MESOS] Support Kerberos ticket renewal and creation in Mesos

2017-11-15 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 2014e7a78 -> 1e8233541


[SPARK-21842][MESOS] Support Kerberos ticket renewal and creation in Mesos

## What changes were proposed in this pull request?
tl;dr: Add a class, `MesosHadoopDelegationTokenManager` that updates delegation 
tokens on a schedule on the behalf of Spark Drivers. Broadcast renewed 
credentials to the executors.

## The problem
We recently added Kerberos support to Mesos-based Spark jobs as well as Secrets 
support to the Mesos Dispatcher (SPARK-16742, SPARK-20812, respectively). 
However the delegation tokens have a defined expiration. This poses a problem 
for long running Spark jobs (e.g. Spark Streaming applications). YARN has a 
solution for this where a thread is scheduled to renew the tokens they reach 
75% of their way to expiration. It then writes the tokens to HDFS for the 
executors to find (uses a monotonically increasing suffix).

## This solution
We replace the current method in `CoarseGrainedSchedulerBackend` which used to 
discard the token renewal time with a protected method 
`fetchHadoopDelegationTokens`. Now the individual cluster backends are 
responsible for overriding this method to fetch and manage token renewal. The 
delegation tokens themselves, are still part of the 
`CoarseGrainedSchedulerBackend` as before.
In the case of Mesos renewed Credentials are broadcasted to the executors. This 
maintains all transfer of Credentials within Spark (as opposed to 
Spark-to-HDFS). It also does not require any writing of Credentials to disk. It 
also does not require any GC of old files.

## How was this patch tested?
Manually against a Kerberized HDFS cluster.

Thank you for the reviews.

Author: ArtRand 

Closes #19272 from ArtRand/spark-21842-450-kerberos-ticket-renewal.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e823354
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e823354
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e823354

Branch: refs/heads/master
Commit: 1e82335413bc2384073ead0d6d581c862036d0f5
Parents: 2014e7a
Author: ArtRand 
Authored: Wed Nov 15 15:53:05 2017 -0800
Committer: Marcelo Vanzin 
Committed: Wed Nov 15 15:53:05 2017 -0800

--
 .../apache/spark/deploy/SparkHadoopUtil.scala   |  28 +++-
 .../security/HadoopDelegationTokenManager.scala |   3 +
 .../executor/CoarseGrainedExecutorBackend.scala |   9 +-
 .../cluster/CoarseGrainedClusterMessage.scala   |   3 +
 .../cluster/CoarseGrainedSchedulerBackend.scala |  30 +---
 .../cluster/mesos/MesosClusterManager.scala |   2 +-
 .../MesosCoarseGrainedSchedulerBackend.scala|  19 ++-
 .../MesosHadoopDelegationTokenManager.scala | 157 +++
 .../yarn/security/AMCredentialRenewer.scala |  21 +--
 9 files changed, 228 insertions(+), 44 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e823354/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
index 1fa10ab..17c7319 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -140,13 +140,24 @@ class SparkHadoopUtil extends Logging {
 if (!new File(keytabFilename).exists()) {
   throw new SparkException(s"Keytab file: ${keytabFilename} does not 
exist")
 } else {
-  logInfo("Attempting to login to Kerberos" +
-s" using principal: ${principalName} and keytab: ${keytabFilename}")
+  logInfo("Attempting to login to Kerberos " +
+s"using principal: ${principalName} and keytab: ${keytabFilename}")
   UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
 }
   }
 
   /**
+   * Add or overwrite current user's credentials with serialized delegation 
tokens,
+   * also confirms correct hadoop configuration is set.
+   */
+  private[spark] def addDelegationTokens(tokens: Array[Byte], sparkConf: 
SparkConf) {
+UserGroupInformation.setConfiguration(newConfiguration(sparkConf))
+val creds = deserialize(tokens)
+logInfo(s"Adding/updating delegation tokens ${dumpTokens(creds)}")
+addCurrentUserCredentials(creds)
+  }
+
+  /**
* Returns a function that can be called to find Hadoop FileSystem bytes 
read. If
* getFSBytesReadOnThreadCallback is called from thread r at time t, the 
returned callback will
* return the bytes read on r since t.
@@ -463,6 +474,19 @@ object SparkHadoopUtil {
   }
 
   /**
+   * Given an expiration date (e.g. for Hadoop Delegation Tokens) return a the 
date
+   * when a given fraction of the duration until the e

spark git commit: [SPARK-22535][PYSPARK] Sleep before killing the python worker in PythonRunner.MonitorThread

2017-11-15 Thread ueshin

Repository: spark
Updated Branches:
  refs/heads/master 1e8233541 -> 03f2b7bff


[SPARK-22535][PYSPARK] Sleep before killing the python worker in 
PythonRunner.MonitorThread

## What changes were proposed in this pull request?

`PythonRunner.MonitorThread` should give the task a little time to finish 
before forcibly killing the python worker. This will reduce the chance of the 
race condition a lot. I also improved the log a bit to find out the task to 
blame when it's stuck.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #19762 from zsxwing/SPARK-22535.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03f2b7bf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03f2b7bf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03f2b7bf

Branch: refs/heads/master
Commit: 03f2b7bff7e537ec747b41ad22e448e1c141f0dd
Parents: 1e82335
Author: Shixiong Zhu 
Authored: Thu Nov 16 14:22:25 2017 +0900
Committer: Takuya UESHIN 
Committed: Thu Nov 16 14:22:25 2017 +0900

--
 .../apache/spark/api/python/PythonRunner.scala  | 21 ++--
 1 file changed, 15 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03f2b7bf/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index d417303..9989f68 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -337,6 +337,9 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   class MonitorThread(env: SparkEnv, worker: Socket, context: TaskContext)
 extends Thread(s"Worker Monitor for $pythonExec") {
 
+/** How long to wait before killing the python worker if a task cannot be 
interrupted. */
+private val taskKillTimeout = 
env.conf.getTimeAsMs("spark.python.task.killTimeout", "2s")
+
 setDaemon(true)
 
 override def run() {
@@ -346,12 +349,18 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 Thread.sleep(2000)
   }
   if (!context.isCompleted) {
-try {
-  logWarning("Incomplete task interrupted: Attempting to kill Python 
Worker")
-  env.destroyPythonWorker(pythonExec, envVars.asScala.toMap, worker)
-} catch {
-  case e: Exception =>
-logError("Exception when trying to kill worker", e)
+Thread.sleep(taskKillTimeout)
+if (!context.isCompleted) {
+  try {
+// Mimic the task name used in `Executor` to help the user find 
out the task to blame.
+val taskName = s"${context.partitionId}.${context.taskAttemptId} " 
+
+  s"in stage ${context.stageId} (TID ${context.taskAttemptId})"
+logWarning(s"Incomplete task $taskName interrupted: Attempting to 
kill Python Worker")
+env.destroyPythonWorker(pythonExec, envVars.asScala.toMap, worker)
+  } catch {
+case e: Exception =>
+  logError("Exception when trying to kill worker", e)
+  }
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files

spark git commit: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow

spark-website git commit: And update the site directory

spark-website git commit: Reduce generated HTML differences

spark git commit: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

spark git commit: [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

spark git commit: [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder

spark git commit: [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric

spark git commit: [SPARK-20649][CORE] Simplify REST API resource structure.

spark git commit: [SPARK-22479][SQL] Exclude credentials from SaveintoDataSourceCommand.simpleString

spark git commit: [SPARK-21842][MESOS] Support Kerberos ticket renewal and creation in Mesos

spark git commit: [SPARK-22535][PYSPARK] Sleep before killing the python worker in PythonRunner.MonitorThread

12 matches

Site Navigation

Mail list logo

Footer information