svn commit: r24654 - in /dev/spark/2.3.1-SNAPSHOT-2018_02_02_22_01-b614c08-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Sat Feb  3 06:14:59 2018
New Revision: 24654

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_02_02_22_01-b614c08 docs


[This commit notification would consist of 1442 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset

2018-02-02 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 3ff83ad43 -> fe73cb4b4


[SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset

## What changes were proposed in this pull request?

In the document of `ContinuousReader.setOffset`, we say this method is used to 
specify the start offset. We also have a `ContinuousReader.getStartOffset` to 
get the value back. I think it makes more sense to rename 
`ContinuousReader.setOffset` to `setStartOffset`.

## How was this patch tested?

N/A

Author: Wenchen Fan 

Closes #20486 from cloud-fan/rename.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fe73cb4b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fe73cb4b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fe73cb4b

Branch: refs/heads/master
Commit: fe73cb4b439169f16cc24cd851a11fd398ce7edf
Parents: 3ff83ad
Author: Wenchen Fan 
Authored: Fri Feb 2 20:49:08 2018 -0800
Committer: gatorsmile 
Committed: Fri Feb 2 20:49:08 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaContinuousReader.scala| 2 +-
 .../spark/sql/sources/v2/reader/streaming/ContinuousReader.java  | 4 ++--
 .../sql/execution/streaming/continuous/ContinuousExecution.scala | 2 +-
 .../streaming/continuous/ContinuousRateStreamSource.scala| 2 +-
 .../apache/spark/sql/execution/streaming/RateSourceV2Suite.scala | 2 +-
 .../spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fe73cb4b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
index 41c443b..b049a05 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
@@ -71,7 +71,7 @@ class KafkaContinuousReader(
   override def readSchema: StructType = KafkaOffsetReader.kafkaSchema
 
   private var offset: Offset = _
-  override def setOffset(start: ju.Optional[Offset]): Unit = {
+  override def setStartOffset(start: ju.Optional[Offset]): Unit = {
 offset = start.orElse {
   val offsets = initialOffsets match {
 case EarliestOffsetRangeLimit => 
KafkaSourceOffset(offsetReader.fetchEarliestOffsets())

http://git-wip-us.apache.org/repos/asf/spark/blob/fe73cb4b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
index d1d1e7f..7fe7f00 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
@@ -51,12 +51,12 @@ public interface ContinuousReader extends 
BaseStreamingSource, DataSourceReader
  * start from the first record after the provided offset, or from an 
implementation-defined
  * inferred starting point if no offset is provided.
  */
-void setOffset(Optional start);
+void setStartOffset(Optional start);
 
 /**
  * Return the specified or inferred start offset for this reader.
  *
- * @throws IllegalStateException if setOffset has not been called
+ * @throws IllegalStateException if setStartOffset has not been called
  */
 Offset getStartOffset();
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fe73cb4b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
index 08c8141..ed22b91 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
@@ -181,7 +181,7 @@ class ContinuousExecution(
 
 val loggedOffset = offsets.offsets(0)
 val realOffset = 

spark git commit: [SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset

2018-02-02 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 dcd0af4be -> b614c083a


[SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset

## What changes were proposed in this pull request?

In the document of `ContinuousReader.setOffset`, we say this method is used to 
specify the start offset. We also have a `ContinuousReader.getStartOffset` to 
get the value back. I think it makes more sense to rename 
`ContinuousReader.setOffset` to `setStartOffset`.

## How was this patch tested?

N/A

Author: Wenchen Fan 

Closes #20486 from cloud-fan/rename.

(cherry picked from commit fe73cb4b439169f16cc24cd851a11fd398ce7edf)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b614c083
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b614c083
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b614c083

Branch: refs/heads/branch-2.3
Commit: b614c083a4875c874180a93b08ea5031fa90cfec
Parents: dcd0af4
Author: Wenchen Fan 
Authored: Fri Feb 2 20:49:08 2018 -0800
Committer: gatorsmile 
Committed: Fri Feb 2 20:49:17 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaContinuousReader.scala| 2 +-
 .../spark/sql/sources/v2/reader/streaming/ContinuousReader.java  | 4 ++--
 .../sql/execution/streaming/continuous/ContinuousExecution.scala | 2 +-
 .../streaming/continuous/ContinuousRateStreamSource.scala| 2 +-
 .../apache/spark/sql/execution/streaming/RateSourceV2Suite.scala | 2 +-
 .../spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b614c083/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
index 41c443b..b049a05 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
@@ -71,7 +71,7 @@ class KafkaContinuousReader(
   override def readSchema: StructType = KafkaOffsetReader.kafkaSchema
 
   private var offset: Offset = _
-  override def setOffset(start: ju.Optional[Offset]): Unit = {
+  override def setStartOffset(start: ju.Optional[Offset]): Unit = {
 offset = start.orElse {
   val offsets = initialOffsets match {
 case EarliestOffsetRangeLimit => 
KafkaSourceOffset(offsetReader.fetchEarliestOffsets())

http://git-wip-us.apache.org/repos/asf/spark/blob/b614c083/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
index d1d1e7f..7fe7f00 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java
@@ -51,12 +51,12 @@ public interface ContinuousReader extends 
BaseStreamingSource, DataSourceReader
  * start from the first record after the provided offset, or from an 
implementation-defined
  * inferred starting point if no offset is provided.
  */
-void setOffset(Optional start);
+void setStartOffset(Optional start);
 
 /**
  * Return the specified or inferred start offset for this reader.
  *
- * @throws IllegalStateException if setOffset has not been called
+ * @throws IllegalStateException if setStartOffset has not been called
  */
 Offset getStartOffset();
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b614c083/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
index 08c8141..ed22b91 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
+++ 

spark git commit: [SQL] Minor doc update: Add an example in DataFrameReader.schema

2018-02-02 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master eaf35de24 -> 3ff83ad43


[SQL] Minor doc update: Add an example in DataFrameReader.schema

## What changes were proposed in this pull request?
This patch adds a small example to the schema string definition of schema 
function. It isn't obvious how to use it, so an example would be useful.

## How was this patch tested?
N/A - doc only.

Author: Reynold Xin 

Closes #20491 from rxin/schema-doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ff83ad4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ff83ad4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ff83ad4

Branch: refs/heads/master
Commit: 3ff83ad43a704cc3354ef9783e711c065e2a1a22
Parents: eaf35de
Author: Reynold Xin 
Authored: Fri Feb 2 20:36:27 2018 -0800
Committer: gatorsmile 
Committed: Fri Feb 2 20:36:27 2018 -0800

--
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala| 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ff83ad4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 46b5f54..fcaf8d6 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -74,6 +74,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* infer the input schema automatically from data. By specifying the schema 
here, the underlying
* data source can skip the schema inference step, and thus speed up data 
loading.
*
+   * {{{
+   *   spark.read.schema("a INT, b STRING, c DOUBLE").csv("test.csv")
+   * }}}
+   *
* @since 2.3.0
*/
   def schema(schemaString: String): DataFrameReader = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SQL] Minor doc update: Add an example in DataFrameReader.schema

2018-02-02 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 56eb9a310 -> dcd0af4be


[SQL] Minor doc update: Add an example in DataFrameReader.schema

## What changes were proposed in this pull request?
This patch adds a small example to the schema string definition of schema 
function. It isn't obvious how to use it, so an example would be useful.

## How was this patch tested?
N/A - doc only.

Author: Reynold Xin 

Closes #20491 from rxin/schema-doc.

(cherry picked from commit 3ff83ad43a704cc3354ef9783e711c065e2a1a22)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dcd0af4b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dcd0af4b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dcd0af4b

Branch: refs/heads/branch-2.3
Commit: dcd0af4be752ab61b8caf36f70d98e97c6925473
Parents: 56eb9a3
Author: Reynold Xin 
Authored: Fri Feb 2 20:36:27 2018 -0800
Committer: gatorsmile 
Committed: Fri Feb 2 20:36:37 2018 -0800

--
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala| 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dcd0af4b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 46b5f54..fcaf8d6 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -74,6 +74,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* infer the input schema automatically from data. By specifying the schema 
here, the underlying
* data source can skip the schema inference step, and thus speed up data 
loading.
*
+   * {{{
+   *   spark.read.schema("a INT, b STRING, c DOUBLE").csv("test.csv")
+   * }}}
+   *
* @since 2.3.0
*/
   def schema(schemaString: String): DataFrameReader = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r24653 - in /dev/spark/2.4.0-SNAPSHOT-2018_02_02_20_01-eaf35de-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Sat Feb  3 04:15:28 2018
New Revision: 24653

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_02_02_20_01-eaf35de docs


[This commit notification would consist of 1444 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r24652 - in /dev/spark/2.3.1-SNAPSHOT-2018_02_02_18_01-56eb9a3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Sat Feb  3 02:15:06 2018
New Revision: 24652

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_02_02_18_01-56eb9a3 docs


[This commit notification would consist of 1442 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

2018-02-02 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 e5e9f9a43 -> 56eb9a310


[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das 

Closes #20494 from tdas/SPARK-23064-2.

(cherry picked from commit eaf35de2471fac4337dd2920026836d52b1ec847)
Signed-off-by: Tathagata Das 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56eb9a31
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56eb9a31
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56eb9a31

Branch: refs/heads/branch-2.3
Commit: 56eb9a310217a5372bdba1e24e4af0d4de1829ca
Parents: e5e9f9a
Author: Tathagata Das 
Authored: Fri Feb 2 17:37:51 2018 -0800
Committer: Tathagata Das 
Committed: Fri Feb 2 17:38:07 2018 -0800

--
 docs/structured-streaming-programming-guide.md | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56eb9a31/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 62589a6..48d6d0b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1346,10 +1346,20 @@ joined <- join(
 
 
 
-However, note that the outer NULL results will be generated with a delay 
(depends on the specified
-watermark delay and the time range condition) because the engine has to wait 
for that long to ensure
+
+There are a few points to note regarding outer joins.
+
+- *The outer NULL results will be generated with a delay that depends on the 
specified watermark
+delay and the time range condition.* This is because the engine has to wait 
for that long to ensure
 there were no matches and there will be no more matches in future.
 
+- In the current implementation in the micro-batch engine, watermarks are 
advanced at the end of a
+micro-batch, and the next micro-batch uses the updated watermark to clean up 
state and output
+outer results. Since we trigger a micro-batch only when there is new data to 
be processed, the
+generation of the outer result may get delayed if there no new data being 
received in the stream.
+*In short, if any of the two input streams being joined does not receive data 
for a while, the
+outer (both cases, left or right) output may get delayed.*
+
 # Support matrix for joins in streaming queries
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

2018-02-02 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master eefec93d1 -> eaf35de24


[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das 

Closes #20494 from tdas/SPARK-23064-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eaf35de2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eaf35de2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eaf35de2

Branch: refs/heads/master
Commit: eaf35de2471fac4337dd2920026836d52b1ec847
Parents: eefec93
Author: Tathagata Das 
Authored: Fri Feb 2 17:37:51 2018 -0800
Committer: Tathagata Das 
Committed: Fri Feb 2 17:37:51 2018 -0800

--
 docs/structured-streaming-programming-guide.md | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eaf35de2/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 62589a6..48d6d0b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1346,10 +1346,20 @@ joined <- join(
 
 
 
-However, note that the outer NULL results will be generated with a delay 
(depends on the specified
-watermark delay and the time range condition) because the engine has to wait 
for that long to ensure
+
+There are a few points to note regarding outer joins.
+
+- *The outer NULL results will be generated with a delay that depends on the 
specified watermark
+delay and the time range condition.* This is because the engine has to wait 
for that long to ensure
 there were no matches and there will be no more matches in future.
 
+- In the current implementation in the micro-batch engine, watermarks are 
advanced at the end of a
+micro-batch, and the next micro-batch uses the updated watermark to clean up 
state and output
+outer results. Since we trigger a micro-batch only when there is new data to 
be processed, the
+generation of the outer result may get delayed if there no new data being 
received in the stream.
+*In short, if any of the two input streams being joined does not receive data 
for a while, the
+outer (both cases, left or right) output may get delayed.*
+
 # Support matrix for joins in streaming queries
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r24649 - in /dev/spark/2.4.0-SNAPSHOT-2018_02_02_12_01-eefec93-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Fri Feb  2 20:15:48 2018
New Revision: 24649

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_02_02_12_01-eefec93 docs


[This commit notification would consist of 1444 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r24648 - in /dev/spark/2.3.1-SNAPSHOT-2018_02_02_10_01-e5e9f9a-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Fri Feb  2 18:15:12 2018
New Revision: 24648

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_02_02_10_01-e5e9f9a docs


[This commit notification would consist of 1442 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-23295][BUILD][MINOR] Exclude Waring message when generating versions in make-distribution.sh

2018-02-02 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master dd52681bf -> eefec93d1


[SPARK-23295][BUILD][MINOR] Exclude Waring message when generating versions in 
make-distribution.sh

## What changes were proposed in this pull request?

When we specified a wrong profile to make a spark distribution, such as 
`-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
requested profile "hadoop1000" could not be activated because it does not 
exist.-bin-hadoop-2.7.tgz`, which actually should be 
`"spark-$VERSION-bin-$NAME.tgz"`

## How was this patch tested?
### before
```
build/mvn help:evaluate -Dexpression=scala.binary.version -Phadoop1000 
2>/dev/null | grep -v "INFO" | tail -n 1
[WARNING] The requested profile "hadoop1000" could not be activated because it 
does not exist.
```
```
build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null | 
grep -v "INFO" | tail -n 1
[WARNING] The requested profile "hadoop1000" could not be activated because it 
does not exist.
```
### after
```
 build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null 
| grep  -v "INFO" | grep -v "WARNING" | tail -n 1
2.4.0-SNAPSHOT
```
```
build/mvn help:evaluate -Dexpression=scala.binary.version 
-Dscala.binary.version=2.11.1 2>/dev/null | grep  -v "INFO" | grep -v "WARNING" 
| tail -n 1
2.11.1
```

cloud-fan srowen

Author: Kent Yao 

Closes #20469 from yaooqinn/dist-minor.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eefec93d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eefec93d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eefec93d

Branch: refs/heads/master
Commit: eefec93d193d43d5b71b8f8a4b1060286da971dd
Parents: dd52681
Author: Kent Yao 
Authored: Fri Feb 2 10:17:51 2018 -0600
Committer: Sean Owen 
Committed: Fri Feb 2 10:17:51 2018 -0600

--
 dev/make-distribution.sh | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eefec93d/dev/make-distribution.sh
--
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index 7245163..8b02446 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -117,15 +117,21 @@ if [ ! "$(command -v "$MVN")" ] ; then
 exit -1;
 fi
 
-VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null | 
grep -v "INFO" | tail -n 1)
+VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\
+| grep -v "INFO"\
+| grep -v "WARNING"\
+| tail -n 1)
 SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 
2>/dev/null\
 | grep -v "INFO"\
+| grep -v "WARNING"\
 | tail -n 1)
 SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 
2>/dev/null\
 | grep -v "INFO"\
+| grep -v "WARNING"\
 | tail -n 1)
 SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl 
sql/hive $@ 2>/dev/null\
 | grep -v "INFO"\
+| grep -v "WARNING"\
 | fgrep --count "hive";\
 # Reset exit status to 0, otherwise the script stops here if the last grep 
finds nothing\
 # because we use "set -o pipefail"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r24645 - in /dev/spark/2.4.0-SNAPSHOT-2018_02_02_08_01-dd52681-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Fri Feb  2 16:15:39 2018
New Revision: 24645

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_02_02_08_01-dd52681 docs


[This commit notification would consist of 1444 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-23253][CORE][SHUFFLE] Only write shuffle temporary index file when there is not an existing one

2018-02-02 Thread irashid
Repository: spark
Updated Branches:
  refs/heads/master b9503fcbb -> dd52681bf


[SPARK-23253][CORE][SHUFFLE] Only write shuffle temporary index file when there 
is not an existing one

## What changes were proposed in this pull request?

Shuffle Index temporay file is used for atomic creating shuffle index file, it 
is not needed when the index file already exists after another attempts of same 
task had it done.

## How was this patch tested?

exitsting ut

cc squito

Author: Kent Yao 

Closes #20422 from yaooqinn/SPARK-23253.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd52681b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd52681b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd52681b

Branch: refs/heads/master
Commit: dd52681bf542386711609cb037a55b3d264eddef
Parents: b9503fc
Author: Kent Yao 
Authored: Fri Feb 2 09:10:50 2018 -0600
Committer: Imran Rashid 
Committed: Fri Feb 2 09:10:50 2018 -0600

--
 .../shuffle/IndexShuffleBlockResolver.scala | 27 +
 .../sort/IndexShuffleBlockResolverSuite.scala   | 59 ++--
 2 files changed, 56 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd52681b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
index 266ee42..c5f3f6e 100644
--- 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
+++ 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
@@ -141,19 +141,6 @@ private[spark] class IndexShuffleBlockResolver(
 val indexFile = getIndexFile(shuffleId, mapId)
 val indexTmp = Utils.tempFileWith(indexFile)
 try {
-  val out = new DataOutputStream(new BufferedOutputStream(new 
FileOutputStream(indexTmp)))
-  Utils.tryWithSafeFinally {
-// We take in lengths of each block, need to convert it to offsets.
-var offset = 0L
-out.writeLong(offset)
-for (length <- lengths) {
-  offset += length
-  out.writeLong(offset)
-}
-  } {
-out.close()
-  }
-
   val dataFile = getDataFile(shuffleId, mapId)
   // There is only one IndexShuffleBlockResolver per executor, this 
synchronization make sure
   // the following check and rename are atomic.
@@ -166,10 +153,22 @@ private[spark] class IndexShuffleBlockResolver(
   if (dataTmp != null && dataTmp.exists()) {
 dataTmp.delete()
   }
-  indexTmp.delete()
 } else {
   // This is the first successful attempt in writing the map outputs 
for this task,
   // so override any existing index and data files with the ones we 
wrote.
+  val out = new DataOutputStream(new BufferedOutputStream(new 
FileOutputStream(indexTmp)))
+  Utils.tryWithSafeFinally {
+// We take in lengths of each block, need to convert it to offsets.
+var offset = 0L
+out.writeLong(offset)
+for (length <- lengths) {
+  offset += length
+  out.writeLong(offset)
+}
+  } {
+out.close()
+  }
+
   if (indexFile.exists()) {
 indexFile.delete()
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/dd52681b/core/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala
 
b/core/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala
index d21ce73..4ce379b 100644
--- 
a/core/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.shuffle.sort
 
-import java.io.{File, FileInputStream, FileOutputStream}
+import java.io.{DataInputStream, File, FileInputStream, FileOutputStream}
 
 import org.mockito.{Mock, MockitoAnnotations}
 import org.mockito.Answers.RETURNS_SMART_NULLS
@@ -64,6 +64,9 @@ class IndexShuffleBlockResolverSuite extends SparkFunSuite 
with BeforeAndAfterEa
   }
 
   test("commit shuffle files multiple times") {
+val shuffleId = 1
+val mapId = 2
+val idxName = s"shuffle_${shuffleId}_${mapId}_0.index"
 val resolver = new IndexShuffleBlockResolver(conf, blockManager)
 val lengths = 

spark git commit: [SPARK-23312][SQL] add a config to turn off vectorized cache reader

2018-02-02 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 2b07452ca -> e5e9f9a43


[SPARK-23312][SQL] add a config to turn off vectorized cache reader

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-23309 reported a performance 
regression about cached table in Spark 2.3. While the investigating is still 
going on, this PR adds a conf to turn off the vectorized cache reader, to 
unblock the 2.3 release.

## How was this patch tested?

a new test

Author: Wenchen Fan 

Closes #20483 from cloud-fan/cache.

(cherry picked from commit b9503fcbb3f4a3ce263164d1f11a8e99b9ca5710)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5e9f9a4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5e9f9a4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5e9f9a4

Branch: refs/heads/branch-2.3
Commit: e5e9f9a430c827669ecfe9d5c13cc555fc89c980
Parents: 2b07452
Author: Wenchen Fan 
Authored: Fri Feb 2 22:43:28 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Feb 2 22:43:51 2018 +0800

--
 .../org/apache/spark/sql/internal/SQLConf.scala  |  8 
 .../execution/columnar/InMemoryTableScanExec.scala   |  2 +-
 .../org/apache/spark/sql/CachedTableSuite.scala  | 15 +--
 3 files changed, 22 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e5e9f9a4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 7394a0d..e498f55 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -141,6 +141,12 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val CACHE_VECTORIZED_READER_ENABLED =
+buildConf("spark.sql.inMemoryColumnarStorage.enableVectorizedReader")
+  .doc("Enables vectorized reader for columnar caching.")
+  .booleanConf
+  .createWithDefault(true)
+
   val COLUMN_VECTOR_OFFHEAP_ENABLED =
 buildConf("spark.sql.columnVector.offheap.enabled")
   .internal()
@@ -1256,6 +1262,8 @@ class SQLConf extends Serializable with Logging {
 
   def columnBatchSize: Int = getConf(COLUMN_BATCH_SIZE)
 
+  def cacheVectorizedReaderEnabled: Boolean = 
getConf(CACHE_VECTORIZED_READER_ENABLED)
+
   def numShufflePartitions: Int = getConf(SHUFFLE_PARTITIONS)
 
   def targetPostShuffleInputSize: Long =

http://git-wip-us.apache.org/repos/asf/spark/blob/e5e9f9a4/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
index c167f1e..e972f8b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
@@ -54,7 +54,7 @@ case class InMemoryTableScanExec(
   override val supportsBatch: Boolean = {
 // In the initial implementation, for ease of review
 // support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
-relation.schema.fields.forall(f => f.dataType match {
+conf.cacheVectorizedReaderEnabled && relation.schema.fields.forall(f => 
f.dataType match {
   case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType => true
   case _ => false

http://git-wip-us.apache.org/repos/asf/spark/blob/e5e9f9a4/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
index 72fe0f4..9f27fa0 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
@@ -21,8 +21,6 @@ import scala.collection.mutable.HashSet
 import scala.concurrent.duration._
 import scala.language.postfixOps
 
-import org.scalatest.concurrent.Eventually._
-
 import org.apache.spark.CleanerListener
 import org.apache.spark.sql.catalyst.TableIdentifier
 import 

spark git commit: [SPARK-23312][SQL] add a config to turn off vectorized cache reader

2018-02-02 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 19c7c7ebd -> b9503fcbb


[SPARK-23312][SQL] add a config to turn off vectorized cache reader

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-23309 reported a performance 
regression about cached table in Spark 2.3. While the investigating is still 
going on, this PR adds a conf to turn off the vectorized cache reader, to 
unblock the 2.3 release.

## How was this patch tested?

a new test

Author: Wenchen Fan 

Closes #20483 from cloud-fan/cache.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b9503fcb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b9503fcb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b9503fcb

Branch: refs/heads/master
Commit: b9503fcbb3f4a3ce263164d1f11a8e99b9ca5710
Parents: 19c7c7e
Author: Wenchen Fan 
Authored: Fri Feb 2 22:43:28 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Feb 2 22:43:28 2018 +0800

--
 .../org/apache/spark/sql/internal/SQLConf.scala  |  8 
 .../execution/columnar/InMemoryTableScanExec.scala   |  2 +-
 .../org/apache/spark/sql/CachedTableSuite.scala  | 15 +--
 3 files changed, 22 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b9503fcb/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 90654e6..1e2501e 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -141,6 +141,12 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val CACHE_VECTORIZED_READER_ENABLED =
+buildConf("spark.sql.inMemoryColumnarStorage.enableVectorizedReader")
+  .doc("Enables vectorized reader for columnar caching.")
+  .booleanConf
+  .createWithDefault(true)
+
   val COLUMN_VECTOR_OFFHEAP_ENABLED =
 buildConf("spark.sql.columnVector.offheap.enabled")
   .internal()
@@ -1272,6 +1278,8 @@ class SQLConf extends Serializable with Logging {
 
   def columnBatchSize: Int = getConf(COLUMN_BATCH_SIZE)
 
+  def cacheVectorizedReaderEnabled: Boolean = 
getConf(CACHE_VECTORIZED_READER_ENABLED)
+
   def numShufflePartitions: Int = getConf(SHUFFLE_PARTITIONS)
 
   def targetPostShuffleInputSize: Long =

http://git-wip-us.apache.org/repos/asf/spark/blob/b9503fcb/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
index c167f1e..e972f8b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
@@ -54,7 +54,7 @@ case class InMemoryTableScanExec(
   override val supportsBatch: Boolean = {
 // In the initial implementation, for ease of review
 // support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
-relation.schema.fields.forall(f => f.dataType match {
+conf.cacheVectorizedReaderEnabled && relation.schema.fields.forall(f => 
f.dataType match {
   case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType => true
   case _ => false

http://git-wip-us.apache.org/repos/asf/spark/blob/b9503fcb/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
index 72fe0f4..9f27fa0 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
@@ -21,8 +21,6 @@ import scala.collection.mutable.HashSet
 import scala.concurrent.duration._
 import scala.language.postfixOps
 
-import org.scalatest.concurrent.Eventually._
-
 import org.apache.spark.CleanerListener
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
@@ -30,6 +28,7 @@ import org.apache.spark.sql.execution.{RDDScanExec, SparkPlan}
 import 

svn commit: r24636 - in /dev/spark/2.4.0-SNAPSHOT-2018_02_02_00_01-19c7c7e-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-02-02 Thread pwendell
Author: pwendell
Date: Fri Feb  2 08:17:02 2018
New Revision: 24636

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_02_02_00_01-19c7c7e docs


[This commit notification would consist of 1444 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org