date:20180316

svn commit: r25786 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_16_01-8a1efe3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-03-16 Thread pwendell

Author: pwendell
Date: Fri Mar 16 23:15:16 2018
New Revision: 25786

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_03_16_16_01-8a1efe3 docs


[This commit notification would consist of 1449 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23683][SQL] FileCommitProtocol.instantiate() hardening

2018-03-16 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 8a72734f3 -> 8a1efe307


[SPARK-23683][SQL] FileCommitProtocol.instantiate() hardening

## What changes were proposed in this pull request?

With SPARK-20236, `FileCommitProtocol.instantiate()` looks for a three argument 
constructor, passing in the `dynamicPartitionOverwrite` parameter. If there is 
no such constructor, it falls back to the classic two-arg one.

When `InsertIntoHadoopFsRelationCommand` passes down that 
`dynamicPartitionOverwrite` flag `to FileCommitProtocol.instantiate(`), it 
assumes that the instantiated protocol supports the specific requirements of 
dynamic partition overwrite. It does not notice when this does not hold, and so 
the output generated may be incorrect.

This patch changes  `FileCommitProtocol.instantiate()` so  when 
`dynamicPartitionOverwrite == true`, it requires the protocol implementation to 
have a 3-arg constructor. Classic two arg constructors are supported when it is 
false.

Also it adds some debug level logging for anyone trying to understand what's 
going on.

## How was this patch tested?

Unit tests verify that

* classes with only 2-arg constructor cannot be used with dynamic overwrite
* classes with only 2-arg constructor can be used without dynamic overwrite
* classes with 3 arg constructors can be used with both.
* the fallback to any two arg ctor takes place after the attempt to load the 
3-arg ctor,
* passing in invalid class types fail as expected (regression tests on expected 
behavior)

Author: Steve Loughran 

Closes #20824 from steveloughran/stevel/SPARK-23683-protocol-instantiate.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a1efe30
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a1efe30
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a1efe30

Branch: refs/heads/master
Commit: 8a1efe3076f29259151f1fba2ff894487efb6c4e
Parents: 8a72734
Author: Steve Loughran 
Authored: Fri Mar 16 15:40:21 2018 -0700
Committer: Wenchen Fan 
Committed: Fri Mar 16 15:40:21 2018 -0700

--
 .../spark/internal/io/FileCommitProtocol.scala  |  11 +-
 .../FileCommitProtocolInstantiationSuite.scala  | 148 +++
 2 files changed, 158 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a1efe30/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala 
b/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala
index 6d0059b..e6e9c9e 100644
--- a/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala
+++ b/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala
@@ -20,6 +20,7 @@ package org.apache.spark.internal.io
 import org.apache.hadoop.fs._
 import org.apache.hadoop.mapreduce._
 
+import org.apache.spark.internal.Logging
 import org.apache.spark.util.Utils
 
 
@@ -132,7 +133,7 @@ abstract class FileCommitProtocol {
 }
 
 
-object FileCommitProtocol {
+object FileCommitProtocol extends Logging {
   class TaskCommitMessage(val obj: Any) extends Serializable
 
   object EmptyTaskCommitMessage extends TaskCommitMessage(null)
@@ -145,15 +146,23 @@ object FileCommitProtocol {
   jobId: String,
   outputPath: String,
   dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol = {
+
+logDebug(s"Creating committer $className; job $jobId; output=$outputPath;" 
+
+  s" dynamic=$dynamicPartitionOverwrite")
 val clazz = 
Utils.classForName(className).asInstanceOf[Class[FileCommitProtocol]]
 // First try the constructor with arguments (jobId: String, outputPath: 
String,
 // dynamicPartitionOverwrite: Boolean).
 // If that doesn't exist, try the one with (jobId: string, outputPath: 
String).
 try {
   val ctor = clazz.getDeclaredConstructor(classOf[String], 
classOf[String], classOf[Boolean])
+  logDebug("Using (String, String, Boolean) constructor")
   ctor.newInstance(jobId, outputPath, 
dynamicPartitionOverwrite.asInstanceOf[java.lang.Boolean])
 } catch {
   case _: NoSuchMethodException =>
+logDebug("Falling back to (String, String) constructor")
+require(!dynamicPartitionOverwrite,
+  "Dynamic Partition Overwrite is enabled but" +
+s" the committer ${className} does not have the appropriate 
constructor")
 val ctor = clazz.getDeclaredConstructor(classOf[String], 
classOf[String])
 ctor.newInstance(jobId, outputPath)
 }

svn commit: r25780 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_12_01-8a72734-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-03-16 Thread pwendell

Author: pwendell
Date: Fri Mar 16 19:15:25 2018
New Revision: 25780

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_03_16_12_01-8a72734 docs


[This commit notification would consist of 1448 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list

2018-03-16 Thread holden

Repository: spark
Updated Branches:
  refs/heads/master bd201bf61 -> 8a72734f3


[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary 
list

## What changes were proposed in this pull request?

Added a class method to construct CountVectorizerModel from a list of 
vocabulary strings, equivalent to the Scala version.  Introduced a common param 
base class `_CountVectorizerParams` to allow the Python model to also own the 
parameters.  This now matches the Scala class hierarchy.

## How was this patch tested?

Added to CountVectorizer doctests to do a transform on a model constructed from 
vocab, and unit test to verify params and vocab are constructed correctly.

Author: Bryan Cutler 

Closes #16770 from 
BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a72734f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a72734f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a72734f

Branch: refs/heads/master
Commit: 8a72734f33f6a0abbd3207b0d661633c8b25d9ad
Parents: bd201bf
Author: Bryan Cutler 
Authored: Fri Mar 16 11:42:57 2018 -0700
Committer: Holden Karau 
Committed: Fri Mar 16 11:42:57 2018 -0700

--
 python/pyspark/ml/feature.py | 168 +-
 python/pyspark/ml/tests.py   |  32 +++-
 2 files changed, 142 insertions(+), 58 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a72734f/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index f2e357f..a1ceb7f 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -19,12 +19,12 @@ import sys
 if sys.version > '3':
 basestring = str
 
-from pyspark import since, keyword_only
+from pyspark import since, keyword_only, SparkContext
 from pyspark.rdd import ignore_unicode_prefix
 from pyspark.ml.linalg import _convert_to_vector
 from pyspark.ml.param.shared import *
 from pyspark.ml.util import JavaMLReadable, JavaMLWritable
-from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaTransformer, _jvm
+from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams, 
JavaTransformer, _jvm
 from pyspark.ml.common import inherit_doc
 
 __all__ = ['Binarizer',
@@ -403,8 +403,69 @@ class Bucketizer(JavaTransformer, HasInputCol, 
HasOutputCol, HasHandleInvalid,
 return self.getOrDefault(self.splits)
 
 
+class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):
+"""
+Params for :py:attr:`CountVectorizer` and :py:attr:`CountVectorizerModel`.
+"""
+
+minTF = Param(
+Params._dummy(), "minTF", "Filter to ignore rare words in" +
+" a document. For each document, terms with frequency/count less than 
the given" +
+" threshold are ignored. If this is an integer >= 1, then this 
specifies a count (of" +
+" times the term must appear in the document); if this is a double in 
[0,1), then this " +
+"specifies a fraction (out of the document's token count). Note that 
the parameter is " +
+"only used in transform of CountVectorizerModel and does not affect 
fitting. Default 1.0",
+typeConverter=TypeConverters.toFloat)
+minDF = Param(
+Params._dummy(), "minDF", "Specifies the minimum number of" +
+" different documents a term must appear in to be included in the 
vocabulary." +
+" If this is an integer >= 1, this specifies the number of documents 
the term must" +
+" appear in; if this is a double in [0,1), then this specifies the 
fraction of documents." +
+" Default 1.0", typeConverter=TypeConverters.toFloat)
+vocabSize = Param(
+Params._dummy(), "vocabSize", "max size of the vocabulary. Default 1 
<< 18.",
+typeConverter=TypeConverters.toInt)
+binary = Param(
+Params._dummy(), "binary", "Binary toggle to control the output vector 
values." +
+" If True, all nonzero counts (after minTF filter applied) are set to 
1. This is useful" +
+" for discrete probabilistic models that model binary events rather 
than integer counts." +
+" Default False", typeConverter=TypeConverters.toBoolean)
+
+def __init__(self, *args):
+super(_CountVectorizerParams, self).__init__(*args)
+self._setDefault(minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False)
+
+@since("1.6.0")
+def getMinTF(self):
+"""
+Gets the value of minTF or its default value.
+"""
+return self.getOrDefault(self.minTF)
+
+@since("1.6.0")
+def getMinDF(self):
+"""
+Gets the value of minDF or its

spark git commit: [SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer

2018-03-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 9945b0227 -> bd201bf61


[SPARK-23623][SS] Avoid concurrent use of cached consumers in 
CachedKafkaConsumer

## What changes were proposed in this pull request?

CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a 
pool of KafkaConsumers that can be reused. However, it was built with the 
assumption there will be only one task using trying to read the same Kafka 
TopicPartition at the same time. Hence, the cache was keyed by the 
TopicPartition a consumer is supposed to read. And any cases where this 
assumption may not be true, we have SparkPlan flag to disable the use of a 
cache. So it was up to the planner to correctly identify when it was not safe 
to use the cache and set the flag accordingly.

Fundamentally, this is the wrong way to approach the problem. It is HARD for a 
high-level planner to reason about the low-level execution model, whether there 
will be multiple tasks in the same query trying to read the same partition. 
Case in point, 2.3.0 introduced stream-stream joins, and you can build a 
streaming self-join query on Kafka. It's pretty non-trivial to figure out how 
this leads to two tasks reading the same partition twice, possibly 
concurrently. And due to the non-triviality, it is hard to figure this out in 
the planner and set the flag to avoid the cache / consumer pool. And this can 
inadvertently lead to ConcurrentModificationException ,or worse, silent reading 
of incorrect data.

Here is a better way to design this. The planner shouldnt have to understand 
these low-level optimizations. Rather the consumer pool should be smart enough 
avoid concurrent use of a cached consumer. Currently, it tries to do so but 
incorrectly (the flag inuse is not checked when returning a cached consumer, 
see 
[this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)).
 If there is another request for the same partition as a currently in-use 
consumer, the pool should automatically return a fresh consumer that should be 
closed when the task is done. Then the planner does not have to have a flag to 
avoid reuses.

This PR is a step towards that goal. It does the following.
- There are effectively two kinds of consumer that may be generated
  - Cached consumer - this should be returned to the pool at task end
  - Non-cached consumer - this should be closed at task end
- A trait called KafkaConsumer is introduced to hide this difference from the 
users of the consumer so that the client code does not have to reason about 
whether to stop and release. They simply called `val consumer = 
KafkaConsumer.acquire` and then `consumer.release()`.
- If there is request for a consumer that is in-use, then a new consumer is 
generated.
- If there is a concurrent attempt of the same task, then a new consumer is 
generated, and the existing cached consumer is marked for close upon release.
- In addition, I renamed the classes because CachedKafkaConsumer is a misnomer 
given that what it returns may or may not be cached.

This PR does not remove the planner flag to avoid reuse to make this patch safe 
enough for merging in branch-2.3. This can be done later in master-only.

## How was this patch tested?
A new stress test that verifies it is safe to concurrently get consumers for 
the same partition from the consumer pool.

Author: Tathagata Das 

Closes #20767 from tdas/SPARK-23623.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bd201bf6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bd201bf6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bd201bf6

Branch: refs/heads/master
Commit: bd201bf61e8e1713deb91b962f670c76c9e3492b
Parents: 9945b02
Author: Tathagata Das 
Authored: Fri Mar 16 11:11:07 2018 -0700
Committer: Shixiong Zhu 
Committed: Fri Mar 16 11:11:07 2018 -0700

--
 .../sql/kafka010/CachedKafkaConsumer.scala  | 438 
 .../sql/kafka010/KafkaContinuousReader.scala|   5 +-
 .../spark/sql/kafka010/KafkaDataConsumer.scala  | 516 +++
 .../sql/kafka010/KafkaMicroBatchReader.scala|  22 +-
 .../spark/sql/kafka010/KafkaSourceRDD.scala |  23 +-
 .../sql/kafka010/CachedKafkaConsumerSuite.scala |  34 --
 .../sql/kafka010/KafkaDataConsumerSuite.scala   | 124 +
 7 files changed, 651 insertions(+), 511 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bd201bf6/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
--
diff --git

spark git commit: [SPARK-23680] Fix entrypoint.sh to properly support Arbitrary UIDs

2018-03-16 Thread eje

Repository: spark
Updated Branches:
  refs/heads/master 88d8de926 -> 9945b0227


[SPARK-23680] Fix entrypoint.sh to properly support Arbitrary UIDs

## What changes were proposed in this pull request?

As described in SPARK-23680, entrypoint.sh returns an error code because of a 
command pipeline execution where it is expected in case of Openshift 
environments, where arbitrary UIDs are used to run containers

## How was this patch tested?

This patch was manually tested by using docker-image-toll.sh script to generate 
a Spark driver image and running an example against an OpenShift cluster.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Ricardo Martinelli de Oliveira 

Closes #20822 from rimolive/rmartine-spark-23680.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9945b022
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9945b022
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9945b022

Branch: refs/heads/master
Commit: 9945b0227efcd952c8e835453b2831a8c6d5d607
Parents: 88d8de9
Author: Ricardo Martinelli de Oliveira 
Authored: Fri Mar 16 10:37:11 2018 -0700
Committer: Erik Erlandson 
Committed: Fri Mar 16 10:37:11 2018 -0700

--
 .../kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh| 3 +++
 1 file changed, 3 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9945b022/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
--
diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
index 3d67b0a..d0cf284 100755
--- 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
+++ 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
@@ -22,7 +22,10 @@ set -ex
 # Check whether there is a passwd entry for the container UID
 myuid=$(id -u)
 mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
 uidentry=$(getent passwd $myuid)
+set -e
 
 # If there is no passwd entry for the container UID, attempt to create one
 if [ -z "$uidentry" ] ; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23581][SQL] Add interpreted unsafe projection

2018-03-16 Thread hvanhovell

Repository: spark
Updated Branches:
  refs/heads/master dffeac369 -> 88d8de926


[SPARK-23581][SQL] Add interpreted unsafe projection

## What changes were proposed in this pull request?
We currently can only create unsafe rows using code generation. This is a 
problem for situations in which code generation fails. There is no fallback, 
and as a result we cannot execute the query.

This PR adds an interpreted version of `UnsafeProjection`. The implementation 
is modeled after `InterpretedMutableProjection`. It stores the expression 
results in a `GenericInternalRow`, and it then uses a conversion function to 
convert the `GenericInternalRow` into an `UnsafeRow`.

This PR does not implement the actual code generated to interpreted fallback 
logic. This will be done in a follow-up.

## How was this patch tested?
I am piggybacking on exiting `UnsafeProjection` tests, and I have added an 
interpreted version for each of these.

Author: Herman van Hovell 

Closes #20750 from hvanhovell/SPARK-23581.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/88d8de92
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/88d8de92
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/88d8de92

Branch: refs/heads/master
Commit: 88d8de9260edf6e9d5449ff7ef6e35d16051fc9f
Parents: dffeac3
Author: Herman van Hovell 
Authored: Fri Mar 16 18:28:16 2018 +0100
Committer: Herman van Hovell 
Committed: Fri Mar 16 18:28:16 2018 +0100

--
 .../expressions/codegen/UnsafeArrayWriter.java  |  32 +-
 .../expressions/codegen/UnsafeRowWriter.java|  30 +-
 .../expressions/codegen/UnsafeWriter.java   |  43 +++
 .../sql/catalyst/expressions/Expression.scala   |  26 ++
 .../InterpretedUnsafeProjection.scala   | 366 +++
 .../expressions/MonotonicallyIncreasingID.scala |   4 +-
 .../sql/catalyst/expressions/Projection.scala   |  19 +-
 .../codegen/GenerateUnsafeProjection.scala  |   2 +-
 .../expressions/randomExpressions.scala |   6 +-
 .../catalyst/expressions/ComplexTypeSuite.scala |   2 +-
 .../expressions/ExpressionEvalHelper.scala  |  20 +-
 .../expressions/ObjectExpressionsSuite.scala|  21 +-
 .../catalyst/expressions/ScalaUDFSuite.scala|   2 +-
 .../expressions/UnsafeRowConverterSuite.scala   |  56 +--
 14 files changed, 555 insertions(+), 74 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/88d8de92/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
index 791e8d8..82cd1b2 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
@@ -30,7 +30,7 @@ import static 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.calculat
  * A helper class to write data into global row buffer using `UnsafeArrayData` 
format,
  * used by {@link 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection}.
  */
-public class UnsafeArrayWriter {
+public final class UnsafeArrayWriter extends UnsafeWriter {
 
   private BufferHolder holder;
 
@@ -83,7 +83,7 @@ public class UnsafeArrayWriter {
 return startingOffset + headerInBytes + ordinal * elementSize;
   }
 
-  public void setOffsetAndSize(int ordinal, long currentCursor, int size) {
+  public void setOffsetAndSize(int ordinal, int currentCursor, int size) {
 assertIndexIsValid(ordinal);
 final long relativeOffset = currentCursor - startingOffset;
 final long offsetAndSize = (relativeOffset << 32) | (long)size;
@@ -96,49 +96,31 @@ public class UnsafeArrayWriter {
 BitSetMethods.set(holder.buffer, startingOffset + 8, ordinal);
   }
 
-  public void setNullBoolean(int ordinal) {
-setNullBit(ordinal);
-// put zero into the corresponding field when set null
-Platform.putBoolean(holder.buffer, getElementOffset(ordinal, 1), false);
-  }
-
-  public void setNullByte(int ordinal) {
+  public void setNull1Bytes(int ordinal) {
 setNullBit(ordinal);
 // put zero into the corresponding field when set null
 Platform.putByte(holder.buffer, getElementOffset(ordinal, 1), (byte)0);
   }
 
-  public void setNullShort(int ordinal) {
+  public void setNull2Bytes(int ordinal) {
 setNullBit(ordinal);
 // put zero into the corresponding field when set null
 Platform.putShort(holder.buffer,

[2/2] spark-website git commit: Squashed commit of the following:

2018-03-16 Thread rxin

Squashed commit of the following:

commit 8e2dd71cf5613be6f019bb76b46226771422a40e
Merge: 8bd24fb6d 01f0b4e0c
Author: Reynold Xin 
Date:   Fri Mar 16 10:24:54 2018 -0700

Merge pull request #104 from mateiz/history

Add a project history page

commit 01f0b4e0c1fe77781850cf994058980664201bce
Author: Matei Zaharia 
Date:   Wed Mar 14 23:29:01 2018 -0700

Add a project history page


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/a1d84bcb
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/a1d84bcb
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/a1d84bcb

Branch: refs/heads/asf-site
Commit: a1d84bcbf53099be51c39914528bea3f4e2735a0
Parents: 8bd24fb
Author: Reynold Xin 
Authored: Fri Mar 16 10:26:14 2018 -0700
Committer: Reynold Xin 
Committed: Fri Mar 16 10:26:14 2018 -0700

--
 _layouts/global.html|   1 +
 community.md|  24 +-
 history.md  |  29 +++
 index.md|  16 +-
 site/committers.html|   1 +
 site/community.html |  24 +-
 site/contributing.html  |   1 +
 site/developer-tools.html   |   1 +
 site/documentation.html |   1 +
 site/downloads.html |   1 +
 site/examples.html  |   1 +
 site/faq.html   |   1 +
 site/graphx/index.html  |   1 +
 site/history.html   | 235 +++
 site/improvement-proposals.html |   1 +
 site/index.html |  17 +-
 site/mailing-lists.html |   1 +
 site/mllib/index.html   |   1 +
 site/news/amp-camp-2013-registration-ope.html   |   1 +
 .../news/announcing-the-first-spark-summit.html |   1 +
 .../news/fourth-spark-screencast-published.html |   1 +
 site/news/index.html|   1 +
 site/news/nsdi-paper.html   |   1 +
 site/news/one-month-to-spark-summit-2015.html   |   1 +
 .../proposals-open-for-spark-summit-east.html   |   1 +
 ...registration-open-for-spark-summit-east.html |   1 +
 .../news/run-spark-and-shark-on-amazon-emr.html |   1 +
 site/news/spark-0-6-1-and-0-5-2-released.html   |   1 +
 site/news/spark-0-6-2-released.html |   1 +
 site/news/spark-0-7-0-released.html |   1 +
 site/news/spark-0-7-2-released.html |   1 +
 site/news/spark-0-7-3-released.html |   1 +
 site/news/spark-0-8-0-released.html |   1 +
 site/news/spark-0-8-1-released.html |   1 +
 site/news/spark-0-9-0-released.html |   1 +
 site/news/spark-0-9-1-released.html |   1 +
 site/news/spark-0-9-2-released.html |   1 +
 site/news/spark-1-0-0-released.html |   1 +
 site/news/spark-1-0-1-released.html |   1 +
 site/news/spark-1-0-2-released.html |   1 +
 site/news/spark-1-1-0-released.html |   1 +
 site/news/spark-1-1-1-released.html |   1 +
 site/news/spark-1-2-0-released.html |   1 +
 site/news/spark-1-2-1-released.html |   1 +
 site/news/spark-1-2-2-released.html |   1 +
 site/news/spark-1-3-0-released.html |   1 +
 site/news/spark-1-4-0-released.html |   1 +
 site/news/spark-1-4-1-released.html |   1 +
 site/news/spark-1-5-0-released.html |   1 +
 site/news/spark-1-5-1-released.html |   1 +
 site/news/spark-1-5-2-released.html |   1 +
 site/news/spark-1-6-0-released.html |   1 +
 site/news/spark-1-6-1-released.html |   1 +
 site/news/spark-1-6-2-released.html |   1 +
 site/news/spark-1-6-3-released.html |   1 +
 site/news/spark-2-0-0-released.html |   1 +
 site/news/spark-2-0-1-released.html |   1 +
 site/news/spark-2-0-2-released.html |   1 +
 site/news/spark-2-1-0-released.html |   1 +
 site/news/spark-2-1-1-released.html |   1 +
 site/news/spark-2-1-2-released.html |   1 +
 site/news/spark-2-2-0-released.html |   1 +
 site/news/spark-2-2-1-released.html |   1 +
 site/news/spark-2-3-0-released.html |   1 +
 site/news/spark-2.0.0-preview.html  |   1 +
 .../spark-accepted-into-apache-incubator.html   |   1 +
 site/news/spark-and-shark-in-the-news.html  |   1 +
 site/news/spark-becomes-tlp.html|   1 +

[1/2] spark-website git commit: Squashed commit of the following:

2018-03-16 Thread rxin

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 8bd24fb6d -> a1d84bcbf


http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2016-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2016-agenda-posted.html 
b/site/news/spark-summit-june-2016-agenda-posted.html
index ce68829..7947354 100644
--- a/site/news/spark-summit-june-2016-agenda-posted.html
+++ b/site/news/spark-summit-june-2016-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2017-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2017-agenda-posted.html 
b/site/news/spark-summit-june-2017-agenda-posted.html
index 5d2df4b..e4055c3 100644
--- a/site/news/spark-summit-june-2017-agenda-posted.html
+++ b/site/news/spark-summit-june-2017-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2018-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2018-agenda-posted.html 
b/site/news/spark-summit-june-2018-agenda-posted.html
index 17c284f..9b2f739 100644
--- a/site/news/spark-summit-june-2018-agenda-posted.html
+++ b/site/news/spark-summit-june-2018-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-tips-from-quantifind.html
--
diff --git a/site/news/spark-tips-from-quantifind.html 
b/site/news/spark-tips-from-quantifind.html
index bfbac1d..00c71c2 100644
--- a/site/news/spark-tips-from-quantifind.html
+++ b/site/news/spark-tips-from-quantifind.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-user-survey-and-powered-by-page.html
--
diff --git a/site/news/spark-user-survey-and-powered-by-page.html 
b/site/news/spark-user-survey-and-powered-by-page.html
index 67935a9..c015e5c 100644
--- a/site/news/spark-user-survey-and-powered-by-page.html
+++ b/site/news/spark-user-survey-and-powered-by-page.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-version-0-6-0-released.html
--
diff --git a/site/news/spark-version-0-6-0-released.html 
b/site/news/spark-version-0-6-0-released.html
index 3f670d7..d9120b0 100644
--- a/site/news/spark-version-0-6-0-released.html
+++ b/site/news/spark-version-0-6-0-released.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-cloudsort-100tb-benchmark.html
--
diff --git a/site/news/spark-wins-cloudsort-100tb-benchmark.html 
b/site/news/spark-wins-cloudsort-100tb-benchmark.html
index b498034..8bef605 100644
--- a/site/news/spark-wins-cloudsort-100tb-benchmark.html
+++ b/site/news/spark-wins-cloudsort-100tb-benchmark.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
--
diff --git a/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html 
b/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
index 18646f4..32f53e9 100644
---

spark git commit: [SPARK-18371][STREAMING] Spark Streaming backpressure generates batch with large number of records

2018-03-16 Thread koeninger

Repository: spark
Updated Branches:
  refs/heads/master 5414abca4 -> dffeac369


[SPARK-18371][STREAMING] Spark Streaming backpressure generates batch with 
large number of records

## What changes were proposed in this pull request?

Omit rounding of backpressure rate. Effects:
- no batch with large number of records is created when rate from PID estimator 
is one
- the number of records per batch and partition is more fine-grained improving 
backpressure accuracy

## How was this patch tested?

This was tested by running:
- `mvn test -pl external/kafka-0-8`
- `mvn test -pl external/kafka-0-10`
- a streaming application which was suffering from the issue

JasonMWhite

The contribution is my original work and I license the work to the project 
under the projectâs open source license

Author: Sebastian Arzt 

Closes #17774 from arzt/kafka-back-pressure.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dffeac36
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dffeac36
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dffeac36

Branch: refs/heads/master
Commit: dffeac3691daa620446ae949c5b147518d128e08
Parents: 5414abc
Author: Sebastian Arzt 
Authored: Fri Mar 16 12:25:58 2018 -0500
Committer: cody koeninger 
Committed: Fri Mar 16 12:25:58 2018 -0500

--
 .../kafka010/DirectKafkaInputDStream.scala  |  6 +--
 .../kafka010/DirectKafkaStreamSuite.scala   | 48 ++
 .../kafka/DirectKafkaInputDStream.scala |  6 +--
 .../kafka/DirectKafkaStreamSuite.scala  | 51 
 4 files changed, 105 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dffeac36/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
--
diff --git 
a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
 
b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
index 0fa3287..9cb2448 100644
--- 
a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
+++ 
b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
@@ -138,17 +138,17 @@ private[spark] class DirectKafkaInputDStream[K, V](
 
 lagPerPartition.map { case (tp, lag) =>
   val maxRateLimitPerPartition = ppc.maxRatePerPartition(tp)
-  val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
+  val backpressureRate = lag / totalLag.toDouble * rate
   tp -> (if (maxRateLimitPerPartition > 0) {
 Math.min(backpressureRate, maxRateLimitPerPartition)} else 
backpressureRate)
 }
-  case None => offsets.map { case (tp, offset) => tp -> 
ppc.maxRatePerPartition(tp) }
+  case None => offsets.map { case (tp, offset) => tp -> 
ppc.maxRatePerPartition(tp).toDouble }
 }
 
 if (effectiveRateLimitPerPartition.values.sum > 0) {
   val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 
1000
   Some(effectiveRateLimitPerPartition.map {
-case (tp, limit) => tp -> (secsPerBatch * limit).toLong
+case (tp, limit) => tp -> Math.max((secsPerBatch * limit).toLong, 1L)
   })
 } else {
   None

http://git-wip-us.apache.org/repos/asf/spark/blob/dffeac36/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
--
diff --git 
a/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
 
b/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
index 453b5e5..8524743 100644
--- 
a/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
+++ 
b/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
@@ -617,6 +617,54 @@ class DirectKafkaStreamSuite
 ssc.stop()
   }
 
+  test("maxMessagesPerPartition with zero offset and rate equal to one") {
+val topic = "backpressure"
+val kafkaParams = getKafkaParams()
+val batchIntervalMilliseconds = 6
+val sparkConf = new SparkConf()
+  // Safe, even with streaming, because we're using the direct API.
+  // Using 1 core is useful to make the test more predictable.
+  .setMaster("local[1]")
+  .setAppName(this.getClass.getSimpleName)
+  .set("spark.streaming.kafka.maxRatePerPartition", "100")
+
+// Setup the streaming context
+ssc = new

svn commit: r25774 - in /dev/spark/2.3.1-SNAPSHOT-2018_03_16_10_01-21b6de4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-03-16 Thread pwendell

Author: pwendell
Date: Fri Mar 16 17:15:31 2018
New Revision: 25774

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_03_16_10_01-21b6de4 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default`

2018-03-16 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 d9e1f7040 -> 21b6de459


[SPARK-23553][TESTS] Tests should not assume the default value of 
`spark.sql.sources.default`

## What changes were proposed in this pull request?

Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format.

This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in 
the future.
- Test new native ORC data source with the full existing Apache Spark test 
coverage.

As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The 
value should be `parquet` when this PR is accepted.

## How was this patch tested?

Pass the Jenkins with updated tests.

Author: Dongjoon Hyun 

Closes #20705 from dongjoon-hyun/SPARK-23553.

(cherry picked from commit 5414abca4fec6a68174c34d22d071c20027e959d)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21b6de45
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21b6de45
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21b6de45

Branch: refs/heads/branch-2.3
Commit: 21b6de4592bf8c69db9135f409912b6a62f70894
Parents: d9e1f70
Author: Dongjoon Hyun 
Authored: Fri Mar 16 09:36:30 2018 -0700
Committer: gatorsmile 
Committed: Fri Mar 16 09:36:46 2018 -0700

--
 python/pyspark/sql/readwriter.py|  4 +-
 .../org/apache/spark/sql/SQLQuerySuite.scala|  9 +--
 .../columnar/InMemoryColumnarQuerySuite.scala   |  5 +-
 .../spark/sql/execution/command/DDLSuite.scala  | 11 ++-
 .../ParquetPartitionDiscoverySuite.scala| 10 +++
 .../sql/test/DataFrameReaderWriterSuite.scala   |  3 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala| 72 +---
 .../PartitionProviderCompatibilitySuite.scala   |  6 +-
 .../hive/PartitionedTablePerfStatsSuite.scala   |  2 +-
 .../spark/sql/hive/execution/HiveDDLSuite.scala | 11 +--
 .../sql/hive/execution/SQLQuerySuite.scala  | 19 ++
 11 files changed, 81 insertions(+), 71 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21b6de45/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 9d05ac7..e70aa9e 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -147,8 +147,8 @@ class DataFrameReader(OptionUtils):
or a DDL-formatted string (For example ``col0 INT, col1 
DOUBLE``).
 :param options: all other string options
 
->>> df = 
spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
-... opt2=1, opt3='str')
+>>> df = 
spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
+... opt1=True, opt2=1, opt3='str')
 >>> df.dtypes
 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
 

http://git-wip-us.apache.org/repos/asf/spark/blob/21b6de45/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 97d03b2..ebebf62 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2148,7 +2148,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   test("data source table created in InMemoryCatalog should be able to 
read/write") {
 withTable("tbl") {
-  sql("CREATE TABLE tbl(i INT, j STRING) USING parquet")
+  val provider = spark.sessionState.conf.defaultDataSourceName
+  sql(s"CREATE TABLE tbl(i INT, j STRING) USING $provider")
   checkAnswer(sql("SELECT i, j FROM tbl"), Nil)
 
   Seq(1 -> "a", 2 -> "b").toDF("i", 
"j").write.mode("overwrite").insertInto("tbl")
@@ -2472,9 +2473,9 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   test("SPARK-16975: Column-partition path starting '_' should be handled 
correctly") {
 withTempDir { dir =>
-  val parquetDir = new File(dir, "parquet").getCanonicalPath
-  spark.range(10).withColumn("_col", 
$"id").write.partitionBy("_col").save(parquetDir)
-  spark.read.parquet(parquetDir)
+  val dataDir = new File(dir, "data").getCanonicalPath
+  spark.range(10).withColumn("_col", 
$"id").write.partitionBy("_col").save(dataDir)
+  spark.read.load(dataDir)
 }

spark git commit: [SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default`

2018-03-16 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master c95200048 -> 5414abca4


[SPARK-23553][TESTS] Tests should not assume the default value of 
`spark.sql.sources.default`

## What changes were proposed in this pull request?

Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format.

This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in 
the future.
- Test new native ORC data source with the full existing Apache Spark test 
coverage.

As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The 
value should be `parquet` when this PR is accepted.

## How was this patch tested?

Pass the Jenkins with updated tests.

Author: Dongjoon Hyun 

Closes #20705 from dongjoon-hyun/SPARK-23553.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5414abca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5414abca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5414abca

Branch: refs/heads/master
Commit: 5414abca4fec6a68174c34d22d071c20027e959d
Parents: c952000
Author: Dongjoon Hyun 
Authored: Fri Mar 16 09:36:30 2018 -0700
Committer: gatorsmile 
Committed: Fri Mar 16 09:36:30 2018 -0700

--
 python/pyspark/sql/readwriter.py|  4 +-
 .../org/apache/spark/sql/SQLQuerySuite.scala|  9 +--
 .../columnar/InMemoryColumnarQuerySuite.scala   |  5 +-
 .../spark/sql/execution/command/DDLSuite.scala  | 11 ++-
 .../ParquetPartitionDiscoverySuite.scala| 10 +++
 .../sql/test/DataFrameReaderWriterSuite.scala   |  3 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala| 72 +---
 .../PartitionProviderCompatibilitySuite.scala   |  6 +-
 .../hive/PartitionedTablePerfStatsSuite.scala   |  2 +-
 .../spark/sql/hive/execution/HiveDDLSuite.scala | 11 +--
 .../sql/hive/execution/SQLQuerySuite.scala  | 19 ++
 11 files changed, 81 insertions(+), 71 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5414abca/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 803f561..facc16b 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -147,8 +147,8 @@ class DataFrameReader(OptionUtils):
or a DDL-formatted string (For example ``col0 INT, col1 
DOUBLE``).
 :param options: all other string options
 
->>> df = 
spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
-... opt2=1, opt3='str')
+>>> df = 
spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
+... opt1=True, opt2=1, opt3='str')
 >>> df.dtypes
 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5414abca/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 8f14575..640affc 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2150,7 +2150,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   test("data source table created in InMemoryCatalog should be able to 
read/write") {
 withTable("tbl") {
-  sql("CREATE TABLE tbl(i INT, j STRING) USING parquet")
+  val provider = spark.sessionState.conf.defaultDataSourceName
+  sql(s"CREATE TABLE tbl(i INT, j STRING) USING $provider")
   checkAnswer(sql("SELECT i, j FROM tbl"), Nil)
 
   Seq(1 -> "a", 2 -> "b").toDF("i", 
"j").write.mode("overwrite").insertInto("tbl")
@@ -2474,9 +2475,9 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   test("SPARK-16975: Column-partition path starting '_' should be handled 
correctly") {
 withTempDir { dir =>
-  val parquetDir = new File(dir, "parquet").getCanonicalPath
-  spark.range(10).withColumn("_col", 
$"id").write.partitionBy("_col").save(parquetDir)
-  spark.read.parquet(parquetDir)
+  val dataDir = new File(dir, "data").getCanonicalPath
+  spark.range(10).withColumn("_col", 
$"id").write.partitionBy("_col").save(dataDir)
+  spark.read.load(dataDir)
 }
   }

svn commit: r25762 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_04_01-c952000-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-03-16 Thread pwendell

Author: pwendell
Date: Fri Mar 16 11:20:56 2018
New Revision: 25762

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_03_16_04_01-c952000 docs


[This commit notification would consist of 1448 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23635][YARN] AM env variable should not overwrite same name env variable set through spark.executorEnv.

2018-03-16 Thread jshao

Repository: spark
Updated Branches:
  refs/heads/master ca83526de -> c95200048


[SPARK-23635][YARN] AM env variable should not overwrite same name env variable 
set through spark.executorEnv.

## What changes were proposed in this pull request?

In the current Spark on YARN code, AM always will copy and overwrite its env 
variables to executors, so we cannot set different values for executors.

To reproduce issue, user could start spark-shell like:

```
./bin/spark-shell --master yarn-client --conf 
spark.executorEnv.SPARK_ABC=executor_val --conf  
spark.yarn.appMasterEnv.SPARK_ABC=am_val
```

Then check executor env variables by

```
sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq }.collect.foreach(println)
```

We will always get `am_val` instead of `executor_val`. So we should not let AM 
to overwrite specifically set executor env variables.

## How was this patch tested?

Added UT and tested in local cluster.

Author: jerryshao 

Closes #20799 from jerryshao/SPARK-23635.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c9520004
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c9520004
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c9520004

Branch: refs/heads/master
Commit: c952000487ee003200221b3c4e25dcb06e359f0a
Parents: ca83526
Author: jerryshao 
Authored: Fri Mar 16 16:22:03 2018 +0800
Committer: jerryshao 
Committed: Fri Mar 16 16:22:03 2018 +0800

--
 .../spark/deploy/yarn/ExecutorRunnable.scala| 22 +++-
 .../spark/deploy/yarn/YarnClusterSuite.scala| 36 
 2 files changed, 50 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c9520004/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
--
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
index 3f4d236..ab08698 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
@@ -220,12 +220,6 @@ private[yarn] class ExecutorRunnable(
 val env = new HashMap[String, String]()
 Client.populateClasspath(null, conf, sparkConf, env, 
sparkConf.get(EXECUTOR_CLASS_PATH))
 
-sparkConf.getExecutorEnv.foreach { case (key, value) =>
-  // This assumes each executor environment variable set here is a path
-  // This is kept for backward compatibility and consistency with hadoop
-  YarnSparkHadoopUtil.addPathToEnvironment(env, key, value)
-}
-
 // lookup appropriate http scheme for container log urls
 val yarnHttpPolicy = conf.get(
   YarnConfiguration.YARN_HTTP_POLICY_KEY,
@@ -233,6 +227,20 @@ private[yarn] class ExecutorRunnable(
 )
 val httpScheme = if (yarnHttpPolicy == "HTTPS_ONLY") "https://; else 
"http://;
 
+System.getenv().asScala.filterKeys(_.startsWith("SPARK"))
+  .foreach { case (k, v) => env(k) = v }
+
+sparkConf.getExecutorEnv.foreach { case (key, value) =>
+  if (key == Environment.CLASSPATH.name()) {
+// If the key of env variable is CLASSPATH, we assume it is a path and 
append it.
+// This is kept for backward compatibility and consistency with hadoop
+YarnSparkHadoopUtil.addPathToEnvironment(env, key, value)
+  } else {
+// For other env variables, simply overwrite the value.
+env(key) = value
+  }
+}
+
 // Add log urls
 container.foreach { c =>
   sys.env.get("SPARK_USER").foreach { user =>
@@ -245,8 +253,6 @@ private[yarn] class ExecutorRunnable(
   }
 }
 
-System.getenv().asScala.filterKeys(_.startsWith("SPARK"))
-  .foreach { case (k, v) => env(k) = v }
 env
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/c9520004/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
--
diff --git 
a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
 
b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
index 33d400a..a129be7 100644
--- 
a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
+++ 
b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
@@ -225,6 +225,14 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
 finalState should be (SparkAppHandle.State.FAILED)
   }
 
+

spark git commit: [SPARK-23644][CORE][UI] Use absolute path for REST call in SHS

2018-03-16 Thread jshao

Repository: spark
Updated Branches:
  refs/heads/master c2632edeb -> ca83526de


[SPARK-23644][CORE][UI] Use absolute path for REST call in SHS

## What changes were proposed in this pull request?

SHS is using a relative path for the REST API call to get the list of the 
application is a relative path call. In case of the SHS being consumed through 
a proxy, it can be an issue if the path doesn't end with a "/".

Therefore, we should use an absolute path for the REST call as it is done for 
all the other resources.

## How was this patch tested?

manual tests
Before the change:
![screen shot 2018-03-10 at 4 22 02 
pm](https://user-images.githubusercontent.com/8821783/37244190-8ccf9d40-2485-11e8-8fa9-345bc81472fc.png)

After the change:
![screen shot 2018-03-10 at 4 36 34 pm 
1](https://user-images.githubusercontent.com/8821783/37244201-a1922810-2485-11e8-8856-eeab2bf5e180.png)

Author: Marco Gaido 

Closes #20794 from mgaido91/SPARK-23644.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca83526d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca83526d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca83526d

Branch: refs/heads/master
Commit: ca83526de55f0f8784df58cc8b7c0a7cb0c96e23
Parents: c2632ed
Author: Marco Gaido 
Authored: Fri Mar 16 15:12:26 2018 +0800
Committer: jerryshao 
Committed: Fri Mar 16 15:12:26 2018 +0800

--
 .../src/main/resources/org/apache/spark/ui/static/historypage.js | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca83526d/core/src/main/resources/org/apache/spark/ui/static/historypage.js
--
diff --git a/core/src/main/resources/org/apache/spark/ui/static/historypage.js 
b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
index f0b2a5a..abc2ec0 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/historypage.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
@@ -113,7 +113,7 @@ $(document).ready(function() {
   status: (requestedIncomplete ? "running" : "completed")
 };
 
-$.getJSON("api/v1/applications", appParams, 
function(response,status,jqXHR) {
+$.getJSON(uiRoot + "/api/v1/applications", appParams, 
function(response,status,jqXHR) {
   var array = [];
   var hasMultipleAttempts = false;
   for (i in response) {
@@ -151,7 +151,7 @@ $(document).ready(function() {
 "showCompletedColumns": !requestedIncomplete,
   }
 
-  $.get("static/historypage-template.html", function(template) {
+  $.get(uiRoot + "/static/historypage-template.html", function(template) {
 var sibling = historySummary.prev();
 historySummary.detach();
 var apps = 
$(Mustache.render($(template).filter("#history-summary-template").html(),data));


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r25786 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_16_01-8a1efe3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-23683][SQL] FileCommitProtocol.instantiate() hardening

svn commit: r25780 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_12_01-8a72734-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list

spark git commit: [SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer

spark git commit: [SPARK-23680] Fix entrypoint.sh to properly support Arbitrary UIDs

spark git commit: [SPARK-23581][SQL] Add interpreted unsafe projection

[2/2] spark-website git commit: Squashed commit of the following:

[1/2] spark-website git commit: Squashed commit of the following:

spark git commit: [SPARK-18371][STREAMING] Spark Streaming backpressure generates batch with large number of records

svn commit: r25774 - in /dev/spark/2.3.1-SNAPSHOT-2018_03_16_10_01-21b6de4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default`

spark git commit: [SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default`

svn commit: r25762 - in /dev/spark/2.4.0-SNAPSHOT-2018_03_16_04_01-c952000-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-23635][YARN] AM env variable should not overwrite same name env variable set through spark.executorEnv.

spark git commit: [SPARK-23644][CORE][UI] Use absolute path for REST call in SHS

16 matches

Site Navigation

Mail list logo

Footer information