from:"zsxwing"

spark git commit: [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming

2016-06-28 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 153c2f9ac -> f454a7f9f


[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to 
pyspark.sql.streaming

## What changes were proposed in this pull request?

- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to 
make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in 
the docs
- Added pyspark.sql.streaming module to the docs

## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.

Author: Tathagata Das 

Closes #13955 from tdas/SPARK-16266.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f454a7f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f454a7f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f454a7f9

Branch: refs/heads/master
Commit: f454a7f9f03807dd768319798daa1351bbfc7288
Parents: 153c2f9
Author: Tathagata Das 
Authored: Tue Jun 28 22:07:11 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Jun 28 22:07:11 2016 -0700

--
 python/docs/pyspark.sql.rst  |   6 +
 python/pyspark/sql/context.py|   3 +-
 python/pyspark/sql/dataframe.py  |   3 +-
 python/pyspark/sql/readwriter.py | 493 +
 python/pyspark/sql/session.py|   3 +-
 python/pyspark/sql/streaming.py  | 502 +-
 6 files changed, 511 insertions(+), 499 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f454a7f9/python/docs/pyspark.sql.rst
--
diff --git a/python/docs/pyspark.sql.rst b/python/docs/pyspark.sql.rst
index 6259379..3be9533 100644
--- a/python/docs/pyspark.sql.rst
+++ b/python/docs/pyspark.sql.rst
@@ -21,3 +21,9 @@ pyspark.sql.functions module
 .. automodule:: pyspark.sql.functions
 :members:
 :undoc-members:
+
+pyspark.sql.streaming module
+
+.. automodule:: pyspark.sql.streaming
+:members:
+:undoc-members:

http://git-wip-us.apache.org/repos/asf/spark/blob/f454a7f9/python/pyspark/sql/context.py
--
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index b5dde13..3503fb9 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -26,7 +26,8 @@ from pyspark import since
 from pyspark.rdd import ignore_unicode_prefix
 from pyspark.sql.session import _monkey_patch_RDD, SparkSession
 from pyspark.sql.dataframe import DataFrame
-from pyspark.sql.readwriter import DataFrameReader, DataStreamReader
+from pyspark.sql.readwriter import DataFrameReader
+from pyspark.sql.streaming import DataStreamReader
 from pyspark.sql.types import Row, StringType
 from pyspark.sql.utils import install_exception_handler
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f454a7f9/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 4f13307..e44b01b 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -33,7 +33,8 @@ from pyspark.storagelevel import StorageLevel
 from pyspark.traceback_utils import SCCallSiteSync
 from pyspark.sql.types import _parse_datatype_json_string
 from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
-from pyspark.sql.readwriter import DataFrameWriter, DataStreamWriter
+from pyspark.sql.readwriter import DataFrameWriter
+from pyspark.sql.streaming import DataStreamWriter
 from pyspark.sql.types import *
 
 __all__ = ["DataFrame", "DataFrameNaFunctions", "DataFrameStatFunctions"]

http://git-wip-us.apache.org/repos/asf/spark/blob/f454a7f9/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3f28d7a..10f307b 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -28,7 +28,7 @@ from pyspark.sql.column import _to_seq
 from pyspark.sql.types import *
 from pyspark.sql import utils
 
-__all__ = ["DataFrameReader", "DataFrameWriter", "DataStreamReader", 
"DataStreamWriter"]
+__all__ = ["DataFrameReader", "DataFrameWriter"]
 
 
 def to_str(value):
@@ -724,494 +724,6 @@ class DataFrameWriter(OptionUtils):
 self._jwrite.mode(mode).jdbc(url, table, jprop)
 
 
-class DataStreamReader(OptionUtils):
-"""
-Interface used to load a streaming :class:`DataFrame` from external 
storage systems
-(e.g. file systems, key-value stores, etc). Use :func:`spark.readStream`
-to access this.
-
-.. note:: Experimenta

[2/2] spark git commit: [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming

2016-06-28 Thread zsxwing

[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to 
pyspark.sql.streaming

## What changes were proposed in this pull request?

- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to 
make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in 
the docs
- Added pyspark.sql.streaming module to the docs

## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.

Author: Tathagata Das 

Closes #13955 from tdas/SPARK-16266.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6650c053
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6650c053
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6650c053

Branch: refs/heads/branch-2.0
Commit: 6650c0533e5c60f8653d2e0a608a42d5838fa553
Parents: 345212b
Author: Tathagata Das 
Authored: Tue Jun 28 22:07:11 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Jun 28 22:17:57 2016 -0700

--
 python/docs/pyspark.sql.rst  |   6 +
 python/pyspark/sql/context.py|   3 +-
 python/pyspark/sql/dataframe.py  |   3 +-
 python/pyspark/sql/readwriter.py | 493 +
 python/pyspark/sql/session.py|   3 +-
 python/pyspark/sql/streaming.py  | 502 +-
 6 files changed, 511 insertions(+), 499 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6650c053/python/docs/pyspark.sql.rst
--
diff --git a/python/docs/pyspark.sql.rst b/python/docs/pyspark.sql.rst
index 6259379..3be9533 100644
--- a/python/docs/pyspark.sql.rst
+++ b/python/docs/pyspark.sql.rst
@@ -21,3 +21,9 @@ pyspark.sql.functions module
 .. automodule:: pyspark.sql.functions
 :members:
 :undoc-members:
+
+pyspark.sql.streaming module
+
+.. automodule:: pyspark.sql.streaming
+:members:
+:undoc-members:

http://git-wip-us.apache.org/repos/asf/spark/blob/6650c053/python/pyspark/sql/context.py
--
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index b5dde13..3503fb9 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -26,7 +26,8 @@ from pyspark import since
 from pyspark.rdd import ignore_unicode_prefix
 from pyspark.sql.session import _monkey_patch_RDD, SparkSession
 from pyspark.sql.dataframe import DataFrame
-from pyspark.sql.readwriter import DataFrameReader, DataStreamReader
+from pyspark.sql.readwriter import DataFrameReader
+from pyspark.sql.streaming import DataStreamReader
 from pyspark.sql.types import Row, StringType
 from pyspark.sql.utils import install_exception_handler
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6650c053/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index c8c8e7d..e6e7029 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -33,7 +33,8 @@ from pyspark.storagelevel import StorageLevel
 from pyspark.traceback_utils import SCCallSiteSync
 from pyspark.sql.types import _parse_datatype_json_string
 from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
-from pyspark.sql.readwriter import DataFrameWriter, DataStreamWriter
+from pyspark.sql.readwriter import DataFrameWriter
+from pyspark.sql.streaming import DataStreamWriter
 from pyspark.sql.types import *
 
 __all__ = ["DataFrame", "DataFrameNaFunctions", "DataFrameStatFunctions"]

http://git-wip-us.apache.org/repos/asf/spark/blob/6650c053/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3f28d7a..10f307b 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -28,7 +28,7 @@ from pyspark.sql.column import _to_seq
 from pyspark.sql.types import *
 from pyspark.sql import utils
 
-__all__ = ["DataFrameReader", "DataFrameWriter", "DataStreamReader", 
"DataStreamWriter"]
+__all__ = ["DataFrameReader", "DataFrameWriter"]
 
 
 def to_str(value):
@@ -724,494 +724,6 @@ class DataFrameWriter(OptionUtils):
 self._jwrite.mode(mode).jdbc(url, table, jprop)
 
 
-class DataStreamReader(OptionUtils):
-"""
-Interface used to load a streaming :class:`DataFrame` from external 
storage systems
-(e.g. file systems, key-value stores, etc). Use :func:`spark.readStream`
-to access this.
-
-.. note:: Experimental.
-
-.. versionadded:: 2.0
-"""
-
-def __init__(self, spark):
-

[1/2] spark git commit: [SPARK-16259][PYSPARK] cleanup options in DataFrame read/write API

2016-06-28 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 22b4072e7 -> 6650c0533


[SPARK-16259][PYSPARK] cleanup options in DataFrame read/write API

## What changes were proposed in this pull request?

There are some duplicated code for options in DataFrame reader/writer API, this 
PR clean them up, it also fix a bug for `escapeQuotes` of csv().

## How was this patch tested?

Existing tests.

Author: Davies Liu 

Closes #13948 from davies/csv_options.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/345212b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/345212b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/345212b9

Branch: refs/heads/branch-2.0
Commit: 345212b9fc91638f6cda8519ddbfec6a780854c1
Parents: 22b4072
Author: Davies Liu 
Authored: Tue Jun 28 13:43:59 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Jun 28 22:17:50 2016 -0700

--
 python/pyspark/sql/readwriter.py | 119 ++
 1 file changed, 20 insertions(+), 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/345212b9/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ccbf895..3f28d7a 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -44,84 +44,20 @@ def to_str(value):
 return str(value)
 
 
-class ReaderUtils(object):
+class OptionUtils(object):
 
-def _set_json_opts(self, schema, primitivesAsString, prefersDecimal,
-   allowComments, allowUnquotedFieldNames, 
allowSingleQuotes,
-   allowNumericLeadingZero, 
allowBackslashEscapingAnyCharacter,
-   mode, columnNameOfCorruptRecord):
+def _set_opts(self, schema=None, **options):
 """
-Set options based on the Json optional parameters
+Set named options (filter out those the value is None)
 """
 if schema is not None:
 self.schema(schema)
-if primitivesAsString is not None:
-self.option("primitivesAsString", primitivesAsString)
-if prefersDecimal is not None:
-self.option("prefersDecimal", prefersDecimal)
-if allowComments is not None:
-self.option("allowComments", allowComments)
-if allowUnquotedFieldNames is not None:
-self.option("allowUnquotedFieldNames", allowUnquotedFieldNames)
-if allowSingleQuotes is not None:
-self.option("allowSingleQuotes", allowSingleQuotes)
-if allowNumericLeadingZero is not None:
-self.option("allowNumericLeadingZero", allowNumericLeadingZero)
-if allowBackslashEscapingAnyCharacter is not None:
-self.option("allowBackslashEscapingAnyCharacter", 
allowBackslashEscapingAnyCharacter)
-if mode is not None:
-self.option("mode", mode)
-if columnNameOfCorruptRecord is not None:
-self.option("columnNameOfCorruptRecord", columnNameOfCorruptRecord)
-
-def _set_csv_opts(self, schema, sep, encoding, quote, escape,
-  comment, header, inferSchema, ignoreLeadingWhiteSpace,
-  ignoreTrailingWhiteSpace, nullValue, nanValue, 
positiveInf, negativeInf,
-  dateFormat, maxColumns, maxCharsPerColumn, 
maxMalformedLogPerPartition, mode):
-"""
-Set options based on the CSV optional parameters
-"""
-if schema is not None:
-self.schema(schema)
-if sep is not None:
-self.option("sep", sep)
-if encoding is not None:
-self.option("encoding", encoding)
-if quote is not None:
-self.option("quote", quote)
-if escape is not None:
-self.option("escape", escape)
-if comment is not None:
-self.option("comment", comment)
-if header is not None:
-self.option("header", header)
-if inferSchema is not None:
-self.option("inferSchema", inferSchema)
-if ignoreLeadingWhiteSpace is not None:
-self.option("ignoreLeadingWhiteSpace", ignoreLeadingWhiteSpace)
-if ignoreTrailingWhiteSpace is not None:
-self.option("ignoreTrailingWhiteSpace", ignoreTrailingWhiteSpace)
-if nullValue is not None:
-self.option("nullValue", nullValue)
-if nanValue is not None:
-self.option("nanValue", nanValue)
-if positiveInf is not None:
-self.option("positiveInf", positiveInf)
-if negativeInf is not None:
-self.option("negativeInf", negativeInf)
-if dateFormat is not None:
-self.option("dateFormat", d

spark git commit: [SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader

2016-06-29 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1cde325e2 -> edd1905c0


[SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader

 What changes were proposed in this pull request?
In Python API, we have the same issue. Thanks for identifying this issue, 
zsxwing ! Below is an example:
```Python
spark.read.format('json').load('python/test_support/sql/people.json')
```
 How was this patch tested?
Existing test cases cover the changes by this PR

Author: gatorsmile 

Closes #13965 from gatorsmile/optionPaths.

(cherry picked from commit 39f2eb1da34f26bf68c535c8e6b796d71a37a651)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/edd1905c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/edd1905c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/edd1905c

Branch: refs/heads/branch-2.0
Commit: edd1905c0fde69025cb6d8d8f15d13d6a6da0e3b
Parents: 1cde325
Author: gatorsmile 
Authored: Wed Jun 29 11:30:49 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Jun 29 11:30:57 2016 -0700

--
 python/pyspark/sql/readwriter.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/edd1905c/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 10f307b..44bf744 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -143,7 +143,9 @@ class DataFrameReader(OptionUtils):
 if schema is not None:
 self.schema(schema)
 self.options(**options)
-if path is not None:
+if isinstance(path, basestring):
+return self._df(self._jreader.load(path))
+elif path is not None:
 if type(path) != list:
 path = [path]
 return 
self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader

2016-06-29 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 8c9cd0a7a -> 39f2eb1da


[SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader

 What changes were proposed in this pull request?
In Python API, we have the same issue. Thanks for identifying this issue, 
zsxwing ! Below is an example:
```Python
spark.read.format('json').load('python/test_support/sql/people.json')
```
 How was this patch tested?
Existing test cases cover the changes by this PR

Author: gatorsmile 

Closes #13965 from gatorsmile/optionPaths.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39f2eb1d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39f2eb1d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39f2eb1d

Branch: refs/heads/master
Commit: 39f2eb1da34f26bf68c535c8e6b796d71a37a651
Parents: 8c9cd0a
Author: gatorsmile 
Authored: Wed Jun 29 11:30:49 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Jun 29 11:30:49 2016 -0700

--
 python/pyspark/sql/readwriter.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39f2eb1d/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 10f307b..44bf744 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -143,7 +143,9 @@ class DataFrameReader(OptionUtils):
 if schema is not None:
 self.schema(schema)
 self.options(**options)
-if path is not None:
+if isinstance(path, basestring):
+return self._df(self._jreader.load(path))
+elif path is not None:
 if type(path) != list:
 path = [path]
 return 
self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15591][WEBUI] Paginate Stage Table in Stages tab

2016-07-06 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 21eadd1d8 -> 478b71d02


[SPARK-15591][WEBUI] Paginate Stage Table in Stages tab

## What changes were proposed in this pull request?

This patch adds pagination support for the Stage Tables in the Stage tab. 
Pagination is provided for all of the four Job Tables (active, pending, 
completed, and failed). Besides, the paged stage tables are also used in 
JobPage (the detail page for one job) and PoolPage.

Interactions (jumping, sorting, and setting page size) for paged tables are 
also included.

## How was this patch tested?

Tested manually by using checking the Web UI after completing and failing 
hundreds of jobs.  Same as the testings for [Paginate Job Table in Jobs 
tab](https://github.com/apache/spark/pull/13620).

This shows the pagination for completed stages:
![paged stage 
table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png)

Author: Tao Lin 

Closes #13708 from nblintao/stageTable.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/478b71d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/478b71d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/478b71d0

Branch: refs/heads/master
Commit: 478b71d028107d42fbb6d1bd300b86efbe0dcf7d
Parents: 21eadd1
Author: Tao Lin 
Authored: Wed Jul 6 10:28:05 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Jul 6 10:28:05 2016 -0700

--
 .../scala/org/apache/spark/ui/PagedTable.scala  |   1 +
 .../apache/spark/ui/jobs/AllStagesPage.scala|  25 +-
 .../org/apache/spark/ui/jobs/JobPage.scala  |  24 +-
 .../org/apache/spark/ui/jobs/PoolPage.scala |  15 +-
 .../org/apache/spark/ui/jobs/StageTable.scala   | 517 +++
 5 files changed, 441 insertions(+), 141 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/478b71d0/core/src/main/scala/org/apache/spark/ui/PagedTable.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala 
b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala
index 9b6ed8c..2a7c16b 100644
--- a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala
+++ b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala
@@ -179,6 +179,7 @@ private[ui] trait PagedTable[T] {
   Splitter
 .on('&')
 .trimResults()
+.omitEmptyStrings()
 .withKeyValueSeparator("=")
 .split(querystring)
 .asScala

http://git-wip-us.apache.org/repos/asf/spark/blob/478b71d0/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala
index e75f1c5..cba8f82 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala
@@ -38,22 +38,24 @@ private[ui] class AllStagesPage(parent: StagesTab) extends 
WebUIPage("") {
   val numCompletedStages = listener.numCompletedStages
   val failedStages = listener.failedStages.reverse.toSeq
   val numFailedStages = listener.numFailedStages
-  val now = System.currentTimeMillis
+  val subPath = "stages"
 
   val activeStagesTable =
-new StageTableBase(activeStages.sortBy(_.submissionTime).reverse,
-  parent.basePath, parent.progressListener, isFairScheduler = 
parent.isFairScheduler,
-  killEnabled = parent.killEnabled)
+new StageTableBase(request, activeStages, "activeStage", 
parent.basePath, subPath,
+  parent.progressListener, parent.isFairScheduler,
+  killEnabled = parent.killEnabled, isFailedStage = false)
   val pendingStagesTable =
-new StageTableBase(pendingStages.sortBy(_.submissionTime).reverse,
-  parent.basePath, parent.progressListener, isFairScheduler = 
parent.isFairScheduler,
-  killEnabled = false)
+new StageTableBase(request, pendingStages, "pendingStage", 
parent.basePath, subPath,
+  parent.progressListener, parent.isFairScheduler,
+  killEnabled = false, isFailedStage = false)
   val completedStagesTable =
-new StageTableBase(completedStages.sortBy(_.submissionTime).reverse, 
parent.basePath,
-  parent.progressListener, isFairScheduler = parent.isFairScheduler, 
killEnabled = false)
+new StageTableBase(request, completedStages, "completedStage", 
parent.basePath, subPath,
+  parent.progressListener, parent.isFairScheduler,
+  killEnabled = false, isFailedStage = false)
   val failedStagesTable =
-new FailedStag

spark git commit: Revert "[SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix"

2016-07-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 45dda9221 -> bb92788f9


Revert "[SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix"

This reverts commit 45dda92214191310a56333a2085e2343eba170cd.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb92788f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb92788f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb92788f

Branch: refs/heads/branch-1.6
Commit: bb92788f96426e57555ba5771e256c6425e0e75e
Parents: 45dda92
Author: Shixiong Zhu 
Authored: Thu Jul 7 10:34:50 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Jul 7 10:34:50 2016 -0700

--
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  2 +-
 .../mllib/linalg/distributed/RowMatrix.scala|  2 +-
 .../linalg/distributed/JavaRowMatrixSuite.java  | 44 
 3 files changed, 2 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb92788f/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index a059e38..1714983 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1110,7 +1110,7 @@ private[python] class PythonMLLibAPI extends Serializable 
{
* Wrapper around RowMatrix constructor.
*/
   def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): 
RowMatrix = {
-new RowMatrix(rows.rdd, numRows, numCols)
+new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/bb92788f/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
index b941d1f..52c0f19 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -526,7 +526,7 @@ class RowMatrix @Since("1.0.0") (
   def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, 
Matrix] = {
 val col = numCols().toInt
 // split rows horizontally into smaller matrices, and compute QR for each 
of them
-val blockQRs = rows.retag(classOf[Vector]).glom().map { partRows =>
+val blockQRs = rows.glom().map { partRows =>
   val bdm = BDM.zeros[Double](partRows.length, col)
   var i = 0
   partRows.foreach { row =>

http://git-wip-us.apache.org/repos/asf/spark/blob/bb92788f/mllib/src/test/java/org/apache/spark/mllib/linalg/distributed/JavaRowMatrixSuite.java
--
diff --git 
a/mllib/src/test/java/org/apache/spark/mllib/linalg/distributed/JavaRowMatrixSuite.java
 
b/mllib/src/test/java/org/apache/spark/mllib/linalg/distributed/JavaRowMatrixSuite.java
deleted file mode 100644
index c01af40..000
--- 
a/mllib/src/test/java/org/apache/spark/mllib/linalg/distributed/JavaRowMatrixSuite.java
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.mllib.linalg.distributed;
-
-import java.util.Arrays;
-
-import org.junit.Test;
-
-import org.apache.spark.SharedSparkSession;
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.mllib.linalg.Matrix;
-import org.apache.spark.mllib.linalg.QRDecomposition;
-import org.apache.spark.mllib.linalg.Vector;
-import org.apache.spark.mllib.linalg.Vectors;
-
-public class JavaRowMatrixSuite extends SharedSparkSession {
-
-  @Test
-  public void rowMatrixQRDecomposition() {

spark git commit: [SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach()

2016-07-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master a04cab8f1 -> 0f7175def


[SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach()

## What changes were proposed in this pull request?

There are cases where `complete` output mode does not output updated aggregated 
value; for details please refer to 
[SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).

The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in 
`ForeachSink.addBatch()`, `foreachPartition()` does not support incremental 
planning for now.

This patches makes `foreachPartition()` support incremental planning in 
`ForeachSink`, by making a special version of `Dataset` with its `rdd()` method 
supporting incremental planning.

## How was this patch tested?

Added a unit test which failed before the change

Author: Liwei Lin 

Closes #14030 from lw-lin/fix-foreach-complete.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f7175de
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f7175de
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f7175de

Branch: refs/heads/master
Commit: 0f7175def985a7f1e37198680f893e749612ab76
Parents: a04cab8
Author: Liwei Lin 
Authored: Thu Jul 7 10:40:42 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Jul 7 10:40:42 2016 -0700

--
 .../sql/execution/streaming/ForeachSink.scala   | 40 -
 .../streaming/IncrementalExecution.scala|  4 +-
 .../execution/streaming/ForeachSinkSuite.scala  | 86 ++--
 3 files changed, 117 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0f7175de/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
index 14b9b1c..082664a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
@@ -18,7 +18,9 @@
 package org.apache.spark.sql.execution.streaming
 
 import org.apache.spark.TaskContext
-import org.apache.spark.sql.{DataFrame, Encoder, ForeachWriter}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Encoder, ForeachWriter}
+import org.apache.spark.sql.catalyst.plans.logical.CatalystSerde
 
 /**
  * A [[Sink]] that forwards all data into [[ForeachWriter]] according to the 
contract defined by
@@ -30,7 +32,41 @@ import org.apache.spark.sql.{DataFrame, Encoder, 
ForeachWriter}
 class ForeachSink[T : Encoder](writer: ForeachWriter[T]) extends Sink with 
Serializable {
 
   override def addBatch(batchId: Long, data: DataFrame): Unit = {
-data.as[T].foreachPartition { iter =>
+// TODO: Refine this method when SPARK-16264 is resolved; see comments 
below.
+
+// This logic should've been as simple as:
+// ```
+//   data.as[T].foreachPartition { iter => ... }
+// ```
+//
+// Unfortunately, doing that would just break the incremental planing. The 
reason is,
+// `Dataset.foreachPartition()` would further call `Dataset.rdd()`, but 
`Dataset.rdd()` just
+// does not support `IncrementalExecution`.
+//
+// So as a provisional fix, below we've made a special version of 
`Dataset` with its `rdd()`
+// method supporting incremental planning. But in the long run, we should 
generally make newly
+// created Datasets use `IncrementalExecution` where necessary (which is 
SPARK-16264 tries to
+// resolve).
+
+val datasetWithIncrementalExecution =
+  new Dataset(data.sparkSession, data.logicalPlan, implicitly[Encoder[T]]) 
{
+override lazy val rdd: RDD[T] = {
+  val objectType = exprEnc.deserializer.dataType
+  val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+
+  // was originally: 
sparkSession.sessionState.executePlan(deserialized) ...
+  val incrementalExecution = new IncrementalExecution(
+this.sparkSession,
+deserialized,
+data.queryExecution.asInstanceOf[IncrementalExecution].outputMode,
+
data.queryExecution.asInstanceOf[IncrementalExecution].checkpointLocation,
+
data.queryExecution.asInstanceOf[IncrementalExecution].currentBatchId)
+  incrementalExecution.toRdd.mapPartitions { rows =>
+rows.map(_.get(0, objectType))
+  }.asInstanceOf[RDD[T]]
+}
+  }
+datasetWithIncrementalExecution.foreachPartition { iter =>
   if (writer.open(TaskContext.getPartitionId(), batchId)) {
 var isFailed = fals

spark git commit: [SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach()

2016-07-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 24933355c -> cbfd94eac


[SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach()

## What changes were proposed in this pull request?

There are cases where `complete` output mode does not output updated aggregated 
value; for details please refer to 
[SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).

The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in 
`ForeachSink.addBatch()`, `foreachPartition()` does not support incremental 
planning for now.

This patches makes `foreachPartition()` support incremental planning in 
`ForeachSink`, by making a special version of `Dataset` with its `rdd()` method 
supporting incremental planning.

## How was this patch tested?

Added a unit test which failed before the change

Author: Liwei Lin 

Closes #14030 from lw-lin/fix-foreach-complete.

(cherry picked from commit 0f7175def985a7f1e37198680f893e749612ab76)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cbfd94ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cbfd94ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cbfd94ea

Branch: refs/heads/branch-2.0
Commit: cbfd94eacf46b61011f1bd8d30f0c134cab37b09
Parents: 2493335
Author: Liwei Lin 
Authored: Thu Jul 7 10:40:42 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Jul 7 10:40:52 2016 -0700

--
 .../sql/execution/streaming/ForeachSink.scala   | 40 -
 .../streaming/IncrementalExecution.scala|  4 +-
 .../execution/streaming/ForeachSinkSuite.scala  | 86 ++--
 3 files changed, 117 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cbfd94ea/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
index 14b9b1c..082664a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala
@@ -18,7 +18,9 @@
 package org.apache.spark.sql.execution.streaming
 
 import org.apache.spark.TaskContext
-import org.apache.spark.sql.{DataFrame, Encoder, ForeachWriter}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Encoder, ForeachWriter}
+import org.apache.spark.sql.catalyst.plans.logical.CatalystSerde
 
 /**
  * A [[Sink]] that forwards all data into [[ForeachWriter]] according to the 
contract defined by
@@ -30,7 +32,41 @@ import org.apache.spark.sql.{DataFrame, Encoder, 
ForeachWriter}
 class ForeachSink[T : Encoder](writer: ForeachWriter[T]) extends Sink with 
Serializable {
 
   override def addBatch(batchId: Long, data: DataFrame): Unit = {
-data.as[T].foreachPartition { iter =>
+// TODO: Refine this method when SPARK-16264 is resolved; see comments 
below.
+
+// This logic should've been as simple as:
+// ```
+//   data.as[T].foreachPartition { iter => ... }
+// ```
+//
+// Unfortunately, doing that would just break the incremental planing. The 
reason is,
+// `Dataset.foreachPartition()` would further call `Dataset.rdd()`, but 
`Dataset.rdd()` just
+// does not support `IncrementalExecution`.
+//
+// So as a provisional fix, below we've made a special version of 
`Dataset` with its `rdd()`
+// method supporting incremental planning. But in the long run, we should 
generally make newly
+// created Datasets use `IncrementalExecution` where necessary (which is 
SPARK-16264 tries to
+// resolve).
+
+val datasetWithIncrementalExecution =
+  new Dataset(data.sparkSession, data.logicalPlan, implicitly[Encoder[T]]) 
{
+override lazy val rdd: RDD[T] = {
+  val objectType = exprEnc.deserializer.dataType
+  val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+
+  // was originally: 
sparkSession.sessionState.executePlan(deserialized) ...
+  val incrementalExecution = new IncrementalExecution(
+this.sparkSession,
+deserialized,
+data.queryExecution.asInstanceOf[IncrementalExecution].outputMode,
+
data.queryExecution.asInstanceOf[IncrementalExecution].checkpointLocation,
+
data.queryExecution.asInstanceOf[IncrementalExecution].currentBatchId)
+  incrementalExecution.toRdd.mapPartitions { rows =>
+rows.map(_.get(0, objectType))
+  }.asInstanceOf[RDD[T]]
+}
+  }
+datasetWithIncrementalExecution.foreachPartit

spark git commit: [SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an exception while creating an Executor

2016-07-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 611a8ca58 -> b2f24f945


[SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an 
exception while creating an Executor

## What changes were proposed in this pull request?

With the fix from SPARK-13112, I see that `LaunchTask` is always processed 
after `RegisteredExecutor` is done and so it gets chance to do all retries to 
startup an executor. There is still a problem that if `Executor` creation 
itself fails and there is some exception, it gets unnoticed and the executor is 
killed when it tries to process the `LaunchTask` as `executor` is null : 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88
 So if one looks at the logs, it does not tell that there was problem during 
`Executor` creation and thats why it was killed.

This PR explicitly catches exception in `Executor` creation, logs a proper 
message and then exits the JVM. Also, I have changed the `exitExecutor` method 
to accept `reason` so that backends can use that reason and do stuff like 
logging to a DB to get an aggregate of such exits at a cluster level

## How was this patch tested?

I am relying on existing tests

Author: Tejas Patil 

Closes #14202 from tejasapatil/exit_executor_failure.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2f24f94
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2f24f94
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2f24f94

Branch: refs/heads/master
Commit: b2f24f94591082d3ff82bd3db1760b14603b38aa
Parents: 611a8ca
Author: Tejas Patil 
Authored: Fri Jul 15 14:27:16 2016 -0700
Committer: Shixiong Zhu 
Committed: Fri Jul 15 14:27:16 2016 -0700

--
 .../executor/CoarseGrainedExecutorBackend.scala | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2f24f94/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index ccc6c36..e30839c 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -23,6 +23,7 @@ import java.util.concurrent.atomic.AtomicBoolean
 
 import scala.collection.mutable
 import scala.util.{Failure, Success}
+import scala.util.control.NonFatal
 
 import org.apache.spark._
 import org.apache.spark.TaskState.TaskState
@@ -64,8 +65,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   case Success(msg) =>
 // Always receive `true`. Just ignore it
   case Failure(e) =>
-logError(s"Cannot register with driver: $driverUrl", e)
-exitExecutor(1)
+exitExecutor(1, s"Cannot register with driver: $driverUrl", e)
 }(ThreadUtils.sameThread)
   }
 
@@ -78,16 +78,19 @@ private[spark] class CoarseGrainedExecutorBackend(
   override def receive: PartialFunction[Any, Unit] = {
 case RegisteredExecutor =>
   logInfo("Successfully registered with driver")
-  executor = new Executor(executorId, hostname, env, userClassPath, 
isLocal = false)
+  try {
+executor = new Executor(executorId, hostname, env, userClassPath, 
isLocal = false)
+  } catch {
+case NonFatal(e) =>
+  exitExecutor(1, "Unable to create executor due to " + e.getMessage, 
e)
+  }
 
 case RegisterExecutorFailed(message) =>
-  logError("Slave registration failed: " + message)
-  exitExecutor(1)
+  exitExecutor(1, "Slave registration failed: " + message)
 
 case LaunchTask(data) =>
   if (executor == null) {
-logError("Received LaunchTask command but executor was null")
-exitExecutor(1)
+exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
 val taskDesc = ser.deserialize[TaskDescription](data.value)
 logInfo("Got assigned task " + taskDesc.taskId)
@@ -97,8 +100,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 
 case KillTask(taskId, _, interruptThread) =>
   if (executor == null) {
-logError("Received KillTask command but executor was null")
-exitExecutor(1)
+exitExecutor(1, "Received KillTask command but executor was null")
   } else {
 executor.killTask(taskId, interruptThread)
   }
@@ -127,8 +129,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 if (stopping.get()) {
   logInfo(s"Driver from $remoteAddress disconnected during shutdown")

spark git commit: [SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an exception while creating an Executor

2016-07-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 e833c906f -> 34ac45a34


[SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an 
exception while creating an Executor

## What changes were proposed in this pull request?

With the fix from SPARK-13112, I see that `LaunchTask` is always processed 
after `RegisteredExecutor` is done and so it gets chance to do all retries to 
startup an executor. There is still a problem that if `Executor` creation 
itself fails and there is some exception, it gets unnoticed and the executor is 
killed when it tries to process the `LaunchTask` as `executor` is null : 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88
 So if one looks at the logs, it does not tell that there was problem during 
`Executor` creation and thats why it was killed.

This PR explicitly catches exception in `Executor` creation, logs a proper 
message and then exits the JVM. Also, I have changed the `exitExecutor` method 
to accept `reason` so that backends can use that reason and do stuff like 
logging to a DB to get an aggregate of such exits at a cluster level

## How was this patch tested?

I am relying on existing tests

Author: Tejas Patil 

Closes #14202 from tejasapatil/exit_executor_failure.

(cherry picked from commit b2f24f94591082d3ff82bd3db1760b14603b38aa)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34ac45a3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34ac45a3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34ac45a3

Branch: refs/heads/branch-2.0
Commit: 34ac45a34d5673112c84ed464a7a23c68c7bd8fe
Parents: e833c90
Author: Tejas Patil 
Authored: Fri Jul 15 14:27:16 2016 -0700
Committer: Shixiong Zhu 
Committed: Fri Jul 15 14:27:29 2016 -0700

--
 .../executor/CoarseGrainedExecutorBackend.scala | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/34ac45a3/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index ccc6c36..e30839c 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -23,6 +23,7 @@ import java.util.concurrent.atomic.AtomicBoolean
 
 import scala.collection.mutable
 import scala.util.{Failure, Success}
+import scala.util.control.NonFatal
 
 import org.apache.spark._
 import org.apache.spark.TaskState.TaskState
@@ -64,8 +65,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   case Success(msg) =>
 // Always receive `true`. Just ignore it
   case Failure(e) =>
-logError(s"Cannot register with driver: $driverUrl", e)
-exitExecutor(1)
+exitExecutor(1, s"Cannot register with driver: $driverUrl", e)
 }(ThreadUtils.sameThread)
   }
 
@@ -78,16 +78,19 @@ private[spark] class CoarseGrainedExecutorBackend(
   override def receive: PartialFunction[Any, Unit] = {
 case RegisteredExecutor =>
   logInfo("Successfully registered with driver")
-  executor = new Executor(executorId, hostname, env, userClassPath, 
isLocal = false)
+  try {
+executor = new Executor(executorId, hostname, env, userClassPath, 
isLocal = false)
+  } catch {
+case NonFatal(e) =>
+  exitExecutor(1, "Unable to create executor due to " + e.getMessage, 
e)
+  }
 
 case RegisterExecutorFailed(message) =>
-  logError("Slave registration failed: " + message)
-  exitExecutor(1)
+  exitExecutor(1, "Slave registration failed: " + message)
 
 case LaunchTask(data) =>
   if (executor == null) {
-logError("Received LaunchTask command but executor was null")
-exitExecutor(1)
+exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
 val taskDesc = ser.deserialize[TaskDescription](data.value)
 logInfo("Got assigned task " + taskDesc.taskId)
@@ -97,8 +100,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 
 case KillTask(taskId, _, interruptThread) =>
   if (executor == null) {
-logError("Received KillTask command but executor was null")
-exitExecutor(1)
+exitExecutor(1, "Received KillTask command but executor was null")
   } else {
 executor.killTask(taskId, interruptThread)
   }
@@ -127,8 +129,7 @@ private[spark] class CoarseGrainedExecutorBackend

spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-25 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master f5ea7fe53 -> 12f490b5c


[SPARK-16715][TESTS] Fix a potential ExprId conflict for 
SubexpressionEliminationSuite."Semantic equals and hash"

## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when 
this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when 
the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14350 from zsxwing/SPARK-16715.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12f490b5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12f490b5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12f490b5

Branch: refs/heads/master
Commit: 12f490b5c85cdee26d47eb70ad1a1edd00504f21
Parents: f5ea7fe
Author: Shixiong Zhu 
Authored: Mon Jul 25 16:08:29 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 16:08:29 2016 -0700

--
 .../catalyst/expressions/SubexpressionEliminationSuite.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12f490b5/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
index 90e97d7..1e39b24 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
@@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType
 
 class SubexpressionEliminationSuite extends SparkFunSuite {
   test("Semantic equals and hash") {
-val id = ExprId(1)
 val a: AttributeReference = AttributeReference("name", IntegerType)()
+val id = {
+  // Make sure we use a "ExprId" different from "a.exprId"
+  val _id = ExprId(1)
+  if (a.exprId == _id) ExprId(2) else _id
+}
 val b1 = a.withName("name2").withExprId(id)
 val b2 = a.withExprId(id)
 val b3 = a.withQualifier(Some("qualifierName"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-25 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1b4f7cf13 -> 41e72f659


[SPARK-16715][TESTS] Fix a potential ExprId conflict for 
SubexpressionEliminationSuite."Semantic equals and hash"

## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when 
this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when 
the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14350 from zsxwing/SPARK-16715.

(cherry picked from commit 12f490b5c85cdee26d47eb70ad1a1edd00504f21)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e72f65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e72f65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e72f65

Branch: refs/heads/branch-2.0
Commit: 41e72f65929c345aa21ebd4e00dadfbfb5acfdf3
Parents: 1b4f7cf
Author: Shixiong Zhu 
Authored: Mon Jul 25 16:08:29 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 16:08:36 2016 -0700

--
 .../catalyst/expressions/SubexpressionEliminationSuite.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/41e72f65/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
index 90e97d7..1e39b24 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
@@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType
 
 class SubexpressionEliminationSuite extends SparkFunSuite {
   test("Semantic equals and hash") {
-val id = ExprId(1)
 val a: AttributeReference = AttributeReference("name", IntegerType)()
+val id = {
+  // Make sure we use a "ExprId" different from "a.exprId"
+  val _id = ExprId(1)
+  if (a.exprId == _id) ExprId(2) else _id
+}
 val b1 = a.withName("name2").withExprId(id)
 val b2 = a.withExprId(id)
 val b3 = a.withQualifier(Some("qualifierName"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab

2016-07-25 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master c979c8bba -> db36e1e75


[SPARK-15590][WEBUI] Paginate Job Table in Jobs tab

## What changes were proposed in this pull request?

This patch adds pagination support for the Job Tables in the Jobs tab. 
Pagination is provided for all of the three Job Tables (active, completed, and 
failed). Interactions (jumping, sorting, and setting page size) for paged 
tables are also included.

The diff didn't keep track of some lines based on the original ones. The 
function `makeRow`of the original `AllJobsPage.scala` is reused. They are 
separated at the beginning of the function `jobRow` (L427-439) and the function 
`row`(L594-618) in the new `AllJobsPage.scala`.

## How was this patch tested?

Tested manually by using checking the Web UI after completing and failing 
hundreds of jobs.
Generate completed jobs by:
```scala
val d = sc.parallelize(Array(1,2,3,4,5))
for(i <- 1 to 255){ var b = d.collect() }
```
Generate failed jobs by calling the following code multiple times:
```scala
var b = d.map(_/0).collect()
```
Interactions like jumping, sorting, and setting page size are all tested.

This shows the pagination for completed jobs:
![paginate success 
jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png)

This shows the sorting works in job tables:
![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png)

This shows the pagination for failed jobs and the effect of jumping and setting 
page size:
![paginate failed 
jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png)

Author: Tao Lin 

Closes #13620 from nblintao/dev.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db36e1e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db36e1e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db36e1e7

Branch: refs/heads/master
Commit: db36e1e75d69d63b76312e85ae3a6c95cebbe65e
Parents: c979c8b
Author: Tao Lin 
Authored: Mon Jul 25 17:35:50 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 17:35:50 2016 -0700

--
 .../org/apache/spark/ui/jobs/AllJobsPage.scala  | 369 ---
 .../org/apache/spark/ui/UISeleniumSuite.scala   |   5 +-
 2 files changed, 312 insertions(+), 62 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/db36e1e7/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
index 035d706..e5363ce 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
@@ -17,17 +17,21 @@
 
 package org.apache.spark.ui.jobs
 
+import java.net.URLEncoder
 import java.util.Date
 import javax.servlet.http.HttpServletRequest
 
+import scala.collection.JavaConverters._
 import scala.collection.mutable.{HashMap, ListBuffer}
 import scala.xml._
 
 import org.apache.commons.lang3.StringEscapeUtils
 
 import org.apache.spark.JobExecutionStatus
-import org.apache.spark.ui.{ToolTips, UIUtils, WebUIPage}
-import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData}
+import org.apache.spark.scheduler.StageInfo
+import org.apache.spark.ui._
+import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData, StageUIData}
+import org.apache.spark.util.Utils
 
 /** Page showing list of all ongoing and recently finished jobs */
 private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage("") {
@@ -210,64 +214,69 @@ private[ui] class AllJobsPage(parent: JobsTab) extends 
WebUIPage("") {
 
   }
 
-  private def jobsTable(jobs: Seq[JobUIData]): Seq[Node] = {
+  private def jobsTable(
+  request: HttpServletRequest,
+  jobTag: String,
+  jobs: Seq[JobUIData]): Seq[Node] = {
+val allParameters = request.getParameterMap.asScala.toMap
+val parameterOtherTable = allParameters.filterNot(_._1.startsWith(jobTag))
+  .map(para => para._1 + "=" + para._2(0))
+
 val someJobHasJobGroup = jobs.exists(_.jobGroup.isDefined)
+val jobIdTitle = if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id"
 
-val columns: Seq[Node] = {
-  {if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id"}
-  Description
-  Submitted
-  Duration
-  Stages: Succeeded/Total
-  Tasks (for all stages): Succeeded/Total
-}
+val parameterJobPage = request.getParameter(jobTag + ".page")
+val parameterJobSortColumn = request.getParameter(jobTag + ".sort")
+val parameterJobSortDesc = request.getParameter(jobTag + ".desc")
+val pa

spark git commit: [SPARK-15869][STREAMING] Fix a potential NPE in StreamingJobProgressListener.getBatchUIData

2016-08-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master ab1e761f9 -> 03d46aafe


[SPARK-15869][STREAMING] Fix a potential NPE in 
StreamingJobProgressListener.getBatchUIData

## What changes were proposed in this pull request?

Moved `asScala` to a `map` to avoid NPE.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu 

Closes #14443 from zsxwing/SPARK-15869.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03d46aaf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03d46aaf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03d46aaf

Branch: refs/heads/master
Commit: 03d46aafe561b03e25f4e25cf01e631c18dd827c
Parents: ab1e761
Author: Shixiong Zhu 
Authored: Mon Aug 1 14:41:22 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Aug 1 14:41:22 2016 -0700

--
 .../apache/spark/streaming/ui/StreamingJobProgressListener.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03d46aaf/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
index c086df4..61f852a 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
@@ -259,7 +259,7 @@ private[streaming] class StreamingJobProgressListener(ssc: 
StreamingContext)
   // We use an Iterable rather than explicitly converting to a seq so that 
updates
   // will propagate
   val outputOpIdToSparkJobIds: Iterable[OutputOpIdAndSparkJobId] =
-Option(batchTimeToOutputOpIdSparkJobIdPair.get(batchTime).asScala)
+
Option(batchTimeToOutputOpIdSparkJobIdPair.get(batchTime)).map(_.asScala)
   .getOrElse(Seq.empty)
   _batchUIData.outputOpIdSparkJobIdPairs = outputOpIdToSparkJobIds
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15869][STREAMING] Fix a potential NPE in StreamingJobProgressListener.getBatchUIData

2016-08-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4e73cb8eb -> 1813bbd9b


[SPARK-15869][STREAMING] Fix a potential NPE in 
StreamingJobProgressListener.getBatchUIData

## What changes were proposed in this pull request?

Moved `asScala` to a `map` to avoid NPE.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu 

Closes #14443 from zsxwing/SPARK-15869.

(cherry picked from commit 03d46aafe561b03e25f4e25cf01e631c18dd827c)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1813bbd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1813bbd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1813bbd9

Branch: refs/heads/branch-2.0
Commit: 1813bbd9bf7cb9afd29e1385f0dc52e8fcc4f132
Parents: 4e73cb8
Author: Shixiong Zhu 
Authored: Mon Aug 1 14:41:22 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Aug 1 14:41:34 2016 -0700

--
 .../apache/spark/streaming/ui/StreamingJobProgressListener.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1813bbd9/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
index c086df4..61f852a 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
@@ -259,7 +259,7 @@ private[streaming] class StreamingJobProgressListener(ssc: 
StreamingContext)
   // We use an Iterable rather than explicitly converting to a seq so that 
updates
   // will propagate
   val outputOpIdToSparkJobIds: Iterable[OutputOpIdAndSparkJobId] =
-Option(batchTimeToOutputOpIdSparkJobIdPair.get(batchTime).asScala)
+
Option(batchTimeToOutputOpIdSparkJobIdPair.get(batchTime)).map(_.asScala)
   .getOrElse(Seq.empty)
   _batchUIData.outputOpIdSparkJobIdPairs = outputOpIdToSparkJobIds
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

2016-08-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master d60af8f6a -> e6bef7d52


[SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

https://issues.apache.org/jira/browse/SPARK-17038

## What changes were proposed in this pull request?

StreamingSource's lastReceivedBatch_submissionTime, 
lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd 
all use data from lastCompletedBatch instead of lastReceivedBatch.

In particular, this makes it impossible to match lastReceivedBatch_records with 
a batchID/submission time.

This is apparent when looking at StreamingSource.scala, lines 89-94.

## How was this patch tested?

Manually running unit tests on local laptop

Author: Xin Ren 

Closes #14681 from keypointt/SPARK-17038.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e6bef7d5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e6bef7d5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e6bef7d5

Branch: refs/heads/master
Commit: e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b
Parents: d60af8f
Author: Xin Ren 
Authored: Wed Aug 17 16:31:42 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Aug 17 16:31:42 2016 -0700

--
 .../scala/org/apache/spark/streaming/StreamingSource.scala | 6 +++---
 .../spark/streaming/ui/StreamingJobProgressListenerSuite.scala | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e6bef7d5/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
index 9697437..0b306a2 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
@@ -87,11 +87,11 @@ private[streaming] class StreamingSource(ssc: 
StreamingContext) extends Source {
   // Gauge for last received batch, useful for monitoring the streaming job's 
running status,
   // displayed data -1 for any abnormal condition.
   registerGaugeWithOption("lastReceivedBatch_submissionTime",
-_.lastCompletedBatch.map(_.submissionTime), -1L)
+_.lastReceivedBatch.map(_.submissionTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingStartTime",
-_.lastCompletedBatch.flatMap(_.processingStartTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingStartTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingEndTime",
-_.lastCompletedBatch.flatMap(_.processingEndTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingEndTime), -1L)
 
   // Gauge for last received batch records.
   registerGauge("lastReceivedBatch_records", 
_.lastReceivedBatchRecords.values.sum, 0L)

http://git-wip-us.apache.org/repos/asf/spark/blob/e6bef7d5/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
index 26b757c..46ab3ac 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
@@ -68,6 +68,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (List(BatchUIData(batchInfoSubmitted)))
 listener.runningBatches should be (Nil)
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be 
(Some(BatchUIData(batchInfoSubmitted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -81,6 +82,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (Nil)
 listener.runningBatches should be (List(BatchUIData(batchInfoStarted)))
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be (Some(BatchUIData(batchInfoStarted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -123,6 +125,7 @@ class StreamingJobProgressListenerSuite extends 
TestSuiteBase with Matchers {
 listener.waitingBatches should be (Nil)
 listener.runningBatches should be (Nil)
 listener.retainedComple

spark git commit: [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

2016-08-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 60de30faf -> 412b0e896


[SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

https://issues.apache.org/jira/browse/SPARK-17038

## What changes were proposed in this pull request?

StreamingSource's lastReceivedBatch_submissionTime, 
lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd 
all use data from lastCompletedBatch instead of lastReceivedBatch.

In particular, this makes it impossible to match lastReceivedBatch_records with 
a batchID/submission time.

This is apparent when looking at StreamingSource.scala, lines 89-94.

## How was this patch tested?

Manually running unit tests on local laptop

Author: Xin Ren 

Closes #14681 from keypointt/SPARK-17038.

(cherry picked from commit e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/412b0e89
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/412b0e89
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/412b0e89

Branch: refs/heads/branch-1.6
Commit: 412b0e8969215411b97efd3d0984dc6cac5d31e0
Parents: 60de30f
Author: Xin Ren 
Authored: Wed Aug 17 16:31:42 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Aug 17 16:32:01 2016 -0700

--
 .../scala/org/apache/spark/streaming/StreamingSource.scala | 6 +++---
 .../spark/streaming/ui/StreamingJobProgressListenerSuite.scala | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/412b0e89/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
index 9697437..0b306a2 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
@@ -87,11 +87,11 @@ private[streaming] class StreamingSource(ssc: 
StreamingContext) extends Source {
   // Gauge for last received batch, useful for monitoring the streaming job's 
running status,
   // displayed data -1 for any abnormal condition.
   registerGaugeWithOption("lastReceivedBatch_submissionTime",
-_.lastCompletedBatch.map(_.submissionTime), -1L)
+_.lastReceivedBatch.map(_.submissionTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingStartTime",
-_.lastCompletedBatch.flatMap(_.processingStartTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingStartTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingEndTime",
-_.lastCompletedBatch.flatMap(_.processingEndTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingEndTime), -1L)
 
   // Gauge for last received batch records.
   registerGauge("lastReceivedBatch_records", 
_.lastReceivedBatchRecords.values.sum, 0L)

http://git-wip-us.apache.org/repos/asf/spark/blob/412b0e89/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
index 34cd743..73c8c56 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
@@ -68,6 +68,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (List(BatchUIData(batchInfoSubmitted)))
 listener.runningBatches should be (Nil)
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be 
(Some(BatchUIData(batchInfoSubmitted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -81,6 +82,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (Nil)
 listener.runningBatches should be (List(BatchUIData(batchInfoStarted)))
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be (Some(BatchUIData(batchInfoStarted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -123,6 +125,7 @@ class StreamingJobProgressListenerSuite extends 
TestSuiteBase with Matchers {
 listen

spark git commit: [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

2016-08-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9406f82db -> 585d1d95c


[SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'

https://issues.apache.org/jira/browse/SPARK-17038

## What changes were proposed in this pull request?

StreamingSource's lastReceivedBatch_submissionTime, 
lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd 
all use data from lastCompletedBatch instead of lastReceivedBatch.

In particular, this makes it impossible to match lastReceivedBatch_records with 
a batchID/submission time.

This is apparent when looking at StreamingSource.scala, lines 89-94.

## How was this patch tested?

Manually running unit tests on local laptop

Author: Xin Ren 

Closes #14681 from keypointt/SPARK-17038.

(cherry picked from commit e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/585d1d95
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/585d1d95
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/585d1d95

Branch: refs/heads/branch-2.0
Commit: 585d1d95cb1c4419c716d3b3f595834927e0c175
Parents: 9406f82
Author: Xin Ren 
Authored: Wed Aug 17 16:31:42 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Aug 17 16:31:50 2016 -0700

--
 .../scala/org/apache/spark/streaming/StreamingSource.scala | 6 +++---
 .../spark/streaming/ui/StreamingJobProgressListenerSuite.scala | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/585d1d95/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
index 9697437..0b306a2 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala
@@ -87,11 +87,11 @@ private[streaming] class StreamingSource(ssc: 
StreamingContext) extends Source {
   // Gauge for last received batch, useful for monitoring the streaming job's 
running status,
   // displayed data -1 for any abnormal condition.
   registerGaugeWithOption("lastReceivedBatch_submissionTime",
-_.lastCompletedBatch.map(_.submissionTime), -1L)
+_.lastReceivedBatch.map(_.submissionTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingStartTime",
-_.lastCompletedBatch.flatMap(_.processingStartTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingStartTime), -1L)
   registerGaugeWithOption("lastReceivedBatch_processingEndTime",
-_.lastCompletedBatch.flatMap(_.processingEndTime), -1L)
+_.lastReceivedBatch.flatMap(_.processingEndTime), -1L)
 
   // Gauge for last received batch records.
   registerGauge("lastReceivedBatch_records", 
_.lastReceivedBatchRecords.values.sum, 0L)

http://git-wip-us.apache.org/repos/asf/spark/blob/585d1d95/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
index 26b757c..46ab3ac 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/ui/StreamingJobProgressListenerSuite.scala
@@ -68,6 +68,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (List(BatchUIData(batchInfoSubmitted)))
 listener.runningBatches should be (Nil)
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be 
(Some(BatchUIData(batchInfoSubmitted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -81,6 +82,7 @@ class StreamingJobProgressListenerSuite extends TestSuiteBase 
with Matchers {
 listener.waitingBatches should be (Nil)
 listener.runningBatches should be (List(BatchUIData(batchInfoStarted)))
 listener.retainedCompletedBatches should be (Nil)
+listener.lastReceivedBatch should be (Some(BatchUIData(batchInfoStarted)))
 listener.lastCompletedBatch should be (None)
 listener.numUnprocessedBatches should be (1)
 listener.numTotalCompletedBatches should be (0)
@@ -123,6 +125,7 @@ class StreamingJobProgressListenerSuite extends 
TestSuiteBase with Matchers {
 listen

spark git commit: [SPARK-17231][CORE] Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-25 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master d2ae6399e -> f20931071


[SPARK-17231][CORE] Avoid building debug or trace log messages unless the 
respective log level is enabled

(This PR addresses https://issues.apache.org/jira/browse/SPARK-17231)

## What changes were proposed in this pull request?

While debugging the performance of a large GraphX connected components 
computation, we found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. According to YourKit, these 
constructions were creating substantial churn in the eden region. Refactoring 
the respective code to avoid these unnecessary constructions except where 
necessary led to a modest but measurable reduction in our job's task time, GC 
time and the ratio thereof.

## How was this patch tested?

We computed the connected components of a graph with about 2.6 billion vertices 
and 1.7 billion edges four times. We used four different EC2 clusters each with 
8 r3.8xl worker nodes. Two test runs used Spark master. Two used Spark master + 
this PR. The results from the first test run, master and master+PR:
![master](https://cloud.githubusercontent.com/assets/833693/17951634/7471cbca-6a18-11e6-9c26-78afe9319685.jpg)
![logging_perf_improvements](https://cloud.githubusercontent.com/assets/833693/17951632/7467844e-6a18-11e6-9a0e-053dc7650413.jpg)

The results from the second test run, master and master+PR:
![master 
2](https://cloud.githubusercontent.com/assets/833693/17951633/746dd6aa-6a18-11e6-8e27-606680b3f105.jpg)
![logging_perf_improvements 
2](https://cloud.githubusercontent.com/assets/833693/17951631/74488710-6a18-11e6-8a32-08692f373386.jpg)

Though modest, I believe these results are significant.

Author: Michael Allman 

Closes #14798 from mallman/spark-17231-logging_perf_improvements.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2093107
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2093107
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2093107

Branch: refs/heads/master
Commit: f2093107196b9af62908ecf15bac043f3b1e64c4
Parents: d2ae639
Author: Michael Allman 
Authored: Thu Aug 25 11:57:38 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Aug 25 11:57:38 2016 -0700

--
 .../spark/network/client/TransportClient.java   | 39 
 .../network/client/TransportClientFactory.java  |  2 +-
 .../client/TransportResponseHandler.java| 15 
 .../spark/network/protocol/MessageDecoder.java  |  2 +-
 .../network/server/TransportChannelHandler.java |  6 +--
 .../network/server/TransportRequestHandler.java | 18 -
 .../spark/network/server/TransportServer.java   |  2 +-
 .../shuffle/ExternalShuffleBlockHandler.java| 14 ---
 .../shuffle/ExternalShuffleBlockResolver.java   |  2 +-
 9 files changed, 55 insertions(+), 45 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f2093107/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
index 64a8317..a67683b 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
@@ -43,7 +43,7 @@ import org.apache.spark.network.protocol.OneWayMessage;
 import org.apache.spark.network.protocol.RpcRequest;
 import org.apache.spark.network.protocol.StreamChunkId;
 import org.apache.spark.network.protocol.StreamRequest;
-import org.apache.spark.network.util.NettyUtils;
+import static org.apache.spark.network.util.NettyUtils.getRemoteAddress;
 
 /**
  * Client for fetching consecutive chunks of a pre-negotiated stream. This API 
is intended to allow
@@ -135,9 +135,10 @@ public class TransportClient implements Closeable {
   long streamId,
   final int chunkIndex,
   final ChunkReceivedCallback callback) {
-final String serverAddr = NettyUtils.getRemoteAddress(channel);
 final long startTime = System.currentTimeMillis();
-logger.debug("Sending fetch chunk request {} to {}", chunkIndex, 
serverAddr);
+if (logger.isDebugEnabled()) {
+  logger.debug("Sending fetch chunk request {} to {}", chunkIndex, 
getRemoteAddress(channel));
+}
 
 final StreamChunkId streamChunkId = new StreamChunkId(streamId, 
chunkIndex);
 handler.addFetchRequest(streamChunkId, callback);
@@ -148,11 +149,13 @@ public class TransportClient implements Closeable {

spark git commit: [SPARK-17231][CORE] Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-25 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ff2e270eb -> 73014a2aa


[SPARK-17231][CORE] Avoid building debug or trace log messages unless the 
respective log level is enabled

This is simply a backport of #14798 to `branch-2.0`. This backport omits the 
change to `ExternalShuffleBlockHandler.java`. In `branch-2.0`, that file does 
not contain the log message that was patched in `master`.

Author: Michael Allman 

Closes #14811 from mallman/spark-17231-logging_perf_improvements-2.0_backport.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73014a2a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73014a2a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73014a2a

Branch: refs/heads/branch-2.0
Commit: 73014a2aa96b538d963f360fd41bac74f358ef46
Parents: ff2e270
Author: Michael Allman 
Authored: Thu Aug 25 16:29:04 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Aug 25 16:29:04 2016 -0700

--
 .../spark/network/client/TransportClient.java   | 39 
 .../network/client/TransportClientFactory.java  |  2 +-
 .../client/TransportResponseHandler.java| 15 
 .../spark/network/protocol/MessageDecoder.java  |  2 +-
 .../network/server/TransportChannelHandler.java |  6 +--
 .../network/server/TransportRequestHandler.java | 18 -
 .../spark/network/server/TransportServer.java   |  2 +-
 .../shuffle/ExternalShuffleBlockResolver.java   |  2 +-
 8 files changed, 47 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/73014a2a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
index 64a8317..a67683b 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
@@ -43,7 +43,7 @@ import org.apache.spark.network.protocol.OneWayMessage;
 import org.apache.spark.network.protocol.RpcRequest;
 import org.apache.spark.network.protocol.StreamChunkId;
 import org.apache.spark.network.protocol.StreamRequest;
-import org.apache.spark.network.util.NettyUtils;
+import static org.apache.spark.network.util.NettyUtils.getRemoteAddress;
 
 /**
  * Client for fetching consecutive chunks of a pre-negotiated stream. This API 
is intended to allow
@@ -135,9 +135,10 @@ public class TransportClient implements Closeable {
   long streamId,
   final int chunkIndex,
   final ChunkReceivedCallback callback) {
-final String serverAddr = NettyUtils.getRemoteAddress(channel);
 final long startTime = System.currentTimeMillis();
-logger.debug("Sending fetch chunk request {} to {}", chunkIndex, 
serverAddr);
+if (logger.isDebugEnabled()) {
+  logger.debug("Sending fetch chunk request {} to {}", chunkIndex, 
getRemoteAddress(channel));
+}
 
 final StreamChunkId streamChunkId = new StreamChunkId(streamId, 
chunkIndex);
 handler.addFetchRequest(streamChunkId, callback);
@@ -148,11 +149,13 @@ public class TransportClient implements Closeable {
 public void operationComplete(ChannelFuture future) throws Exception {
   if (future.isSuccess()) {
 long timeTaken = System.currentTimeMillis() - startTime;
-logger.trace("Sending request {} to {} took {} ms", streamChunkId, 
serverAddr,
-  timeTaken);
+if (logger.isTraceEnabled()) {
+  logger.trace("Sending request {} to {} took {} ms", 
streamChunkId, getRemoteAddress(channel),
+timeTaken);
+}
   } else {
 String errorMsg = String.format("Failed to send request %s to %s: 
%s", streamChunkId,
-  serverAddr, future.cause());
+  getRemoteAddress(channel), future.cause());
 logger.error(errorMsg, future.cause());
 handler.removeFetchRequest(streamChunkId);
 channel.close();
@@ -173,9 +176,10 @@ public class TransportClient implements Closeable {
* @param callback Object to call with the stream data.
*/
   public void stream(final String streamId, final StreamCallback callback) {
-final String serverAddr = NettyUtils.getRemoteAddress(channel);
 final long startTime = System.currentTimeMillis();
-logger.debug("Sending stream request for {} to {}", streamId, serverAddr);
+if (logger.isDebugEnabled()) {
+  logger.debug("Sending stream request for {} to {}", streamId, 
getRemoteAddress(channel));
+}
 
 // Need to synchronize he

spark git commit: [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 6f82d2da3 -> deb6a54cf


[SPARK-17165][SQL] FileStreamSource should not track the list of seen files 
indefinitely

## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the 
list of files processed by the engine. The list can grow indefinitely, leading 
to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 
24 hours. If a file is older than this age, FileStreamSource will purge it from 
the in-memory map that was used to track the list of files that have been 
processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test 
to validate the purge in FileStreamSourceSuite. Also verified the new test 
cases would fail when the timeout was set to a very large number.

Author: petermaxlee 

Closes #14728 from petermaxlee/SPARK-17165.

(cherry picked from commit 9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/deb6a54c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/deb6a54c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/deb6a54c

Branch: refs/heads/branch-2.0
Commit: deb6a54cf0f69d4ac5b3e1d358bb81e49eea412d
Parents: 6f82d2d
Author: petermaxlee 
Authored: Fri Aug 26 11:30:23 2016 -0700
Committer: Shixiong Zhu 
Committed: Fri Aug 26 11:30:38 2016 -0700

--
 .../execution/streaming/FileStreamOptions.scala |  54 +++
 .../execution/streaming/FileStreamSource.scala  | 149 +++
 .../execution/streaming/HDFSMetadataLog.scala   |   2 +-
 .../streaming/FileStreamSourceSuite.scala   |  76 ++
 .../sql/streaming/FileStreamSourceSuite.scala   |  40 -
 5 files changed, 285 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/deb6a54c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
new file mode 100644
index 000..3efc20c
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import scala.util.Try
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.execution.datasources.CaseInsensitiveMap
+import org.apache.spark.util.Utils
+
+/**
+ * User specified options for file streams.
+ */
+class FileStreamOptions(parameters: Map[String, String]) extends Logging {
+
+  val maxFilesPerTrigger: Option[Int] = 
parameters.get("maxFilesPerTrigger").map { str =>
+Try(str.toInt).toOption.filter(_ > 0).getOrElse {
+  throw new IllegalArgumentException(
+s"Invalid value '$str' for option 'maxFilesPerTrigger', must be a 
positive integer")
+}
+  }
+
+  /**
+   * Maximum age of a file that can be found in this directory, before it is 
deleted.
+   *
+   * The max age is specified with respect to the timestamp of the latest 
file, and not the
+   * timestamp of the current system. That this means if the last file has 
timestamp 1000, and the
+   * current system time is 2000, and max age is 200, the system will purge 
files older than
+   * 800 (rather than 1800) from the internal state.
+   *
+   * Default to a week.
+   */
+  val maxFileAgeMs: Long =
+Utils.timeStringAsMs(parameters.getOrElse("maxFileAge", "7d"))
+
+  /** Options as specified by the user, in a case-insensitive map, without 
"path" set. */
+  val optionMapWithoutPath: Map[String, String] =
+new CaseInsensitiveMap(parameters).filterKeys(_ != "path")
+}

http://git-wip-us.apache.org/repos/asf/spark/b

spark git commit: [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 261c55dd8 -> 9812f7d53


[SPARK-17165][SQL] FileStreamSource should not track the list of seen files 
indefinitely

## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the 
list of files processed by the engine. The list can grow indefinitely, leading 
to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 
24 hours. If a file is older than this age, FileStreamSource will purge it from 
the in-memory map that was used to track the list of files that have been 
processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test 
to validate the purge in FileStreamSourceSuite. Also verified the new test 
cases would fail when the timeout was set to a very large number.

Author: petermaxlee 

Closes #14728 from petermaxlee/SPARK-17165.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9812f7d5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9812f7d5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9812f7d5

Branch: refs/heads/master
Commit: 9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88
Parents: 261c55d
Author: petermaxlee 
Authored: Fri Aug 26 11:30:23 2016 -0700
Committer: Shixiong Zhu 
Committed: Fri Aug 26 11:30:23 2016 -0700

--
 .../execution/streaming/FileStreamOptions.scala |  54 +++
 .../execution/streaming/FileStreamSource.scala  | 149 +++
 .../execution/streaming/HDFSMetadataLog.scala   |   2 +-
 .../streaming/FileStreamSourceSuite.scala   |  76 ++
 .../sql/streaming/FileStreamSourceSuite.scala   |  40 -
 5 files changed, 285 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9812f7d5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
new file mode 100644
index 000..3efc20c
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import scala.util.Try
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.execution.datasources.CaseInsensitiveMap
+import org.apache.spark.util.Utils
+
+/**
+ * User specified options for file streams.
+ */
+class FileStreamOptions(parameters: Map[String, String]) extends Logging {
+
+  val maxFilesPerTrigger: Option[Int] = 
parameters.get("maxFilesPerTrigger").map { str =>
+Try(str.toInt).toOption.filter(_ > 0).getOrElse {
+  throw new IllegalArgumentException(
+s"Invalid value '$str' for option 'maxFilesPerTrigger', must be a 
positive integer")
+}
+  }
+
+  /**
+   * Maximum age of a file that can be found in this directory, before it is 
deleted.
+   *
+   * The max age is specified with respect to the timestamp of the latest 
file, and not the
+   * timestamp of the current system. That this means if the last file has 
timestamp 1000, and the
+   * current system time is 2000, and max age is 200, the system will purge 
files older than
+   * 800 (rather than 1800) from the internal state.
+   *
+   * Default to a week.
+   */
+  val maxFileAgeMs: Long =
+Utils.timeStringAsMs(parameters.getOrElse("maxFileAge", "7d"))
+
+  /** Options as specified by the user, in a case-insensitive map, without 
"path" set. */
+  val optionMapWithoutPath: Map[String, String] =
+new CaseInsensitiveMap(parameters).filterKeys(_ != "path")
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/9812f7d5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--

spark git commit: [SPARK-17314][CORE] Use Netty's DefaultThreadFactory to enable its fast ThreadLocal impl

2016-08-30 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master fb2008431 -> 02ac379e8


[SPARK-17314][CORE] Use Netty's DefaultThreadFactory to enable its fast 
ThreadLocal impl

## What changes were proposed in this pull request?

When a thread is a Netty's FastThreadLocalThread, Netty will use its fast 
ThreadLocal implementation. It has a better performance than JDK's (See the 
benchmark results in https://github.com/netty/netty/pull/4417, note: it's not a 
fix to Netty's FastThreadLocal. It just fixed an issue in Netty's benchmark 
codes)

This PR just changed the ThreadFactory to Netty's DefaultThreadFactory which 
will use FastThreadLocalThread. There is also a minor change to the thread 
names. See 
https://github.com/netty/netty/blob/netty-4.0.22.Final/common/src/main/java/io/netty/util/concurrent/DefaultThreadFactory.java#L94

## How was this patch tested?

Author: Shixiong Zhu 

Closes #14879 from zsxwing/netty-thread.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02ac379e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02ac379e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02ac379e

Branch: refs/heads/master
Commit: 02ac379e8645ce5d32e033f6683136da16fbe584
Parents: fb20084
Author: Shixiong Zhu 
Authored: Tue Aug 30 13:22:21 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Aug 30 13:22:21 2016 -0700

--
 .../main/java/org/apache/spark/network/util/NettyUtils.java   | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02ac379e/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java
index 10de9d3..5e85180 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java
@@ -20,7 +20,6 @@ package org.apache.spark.network.util;
 import java.lang.reflect.Field;
 import java.util.concurrent.ThreadFactory;
 
-import com.google.common.util.concurrent.ThreadFactoryBuilder;
 import io.netty.buffer.PooledByteBufAllocator;
 import io.netty.channel.Channel;
 import io.netty.channel.EventLoopGroup;
@@ -31,6 +30,7 @@ import io.netty.channel.epoll.EpollSocketChannel;
 import io.netty.channel.nio.NioEventLoopGroup;
 import io.netty.channel.socket.nio.NioServerSocketChannel;
 import io.netty.channel.socket.nio.NioSocketChannel;
+import io.netty.util.concurrent.DefaultThreadFactory;
 import io.netty.util.internal.PlatformDependent;
 
 /**
@@ -39,10 +39,7 @@ import io.netty.util.internal.PlatformDependent;
 public class NettyUtils {
   /** Creates a new ThreadFactory which prefixes each thread with the given 
name. */
   public static ThreadFactory createThreadFactory(String threadPoolPrefix) {
-return new ThreadFactoryBuilder()
-  .setDaemon(true)
-  .setNameFormat(threadPoolPrefix + "-%d")
-  .build();
+return new DefaultThreadFactory(threadPoolPrefix, true);
   }
 
   /** Creates a Netty EventLoopGroup based on the IOMode. */


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl

2016-08-30 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master f7beae6da -> 231f97329


[SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class 
defined in repl

## What changes were proposed in this pull request?

There are a lot of failures recently: 
http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl

This PR just changed the persist level to `MEMORY_AND_DISK_2` to avoid blocks 
being evicted from memory.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14884 from zsxwing/SPARK-17318.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/231f9732
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/231f9732
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/231f9732

Branch: refs/heads/master
Commit: 231f973295129dca976f2e4a8222a63318d4aafe
Parents: f7beae6
Author: Shixiong Zhu 
Authored: Tue Aug 30 20:04:52 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Aug 30 20:04:52 2016 -0700

--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/231f9732/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 06b09f3..f1284b1 100644
--- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -401,7 +401,7 @@ class ReplSuite extends SparkFunSuite {
   """
 |import org.apache.spark.storage.StorageLevel._
 |case class Foo(i: Int)
-|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
+|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_AND_DISK_2)
 |ret.count()
 |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
   """.stripMargin)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl

2016-08-30 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f35b10ab1 -> bc6c0d9f9


[SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class 
defined in repl

## What changes were proposed in this pull request?

There are a lot of failures recently: 
http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl

This PR just changed the persist level to `MEMORY_AND_DISK_2` to avoid blocks 
being evicted from memory.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14884 from zsxwing/SPARK-17318.

(cherry picked from commit 231f973295129dca976f2e4a8222a63318d4aafe)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc6c0d9f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc6c0d9f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc6c0d9f

Branch: refs/heads/branch-2.0
Commit: bc6c0d9f96da6a9aaf8279ee4ad11a82bcd69cb5
Parents: f35b10a
Author: Shixiong Zhu 
Authored: Tue Aug 30 20:04:52 2016 -0700
Committer: Shixiong Zhu 
Committed: Tue Aug 30 20:05:06 2016 -0700

--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bc6c0d9f/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 06b09f3..f1284b1 100644
--- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -401,7 +401,7 @@ class ReplSuite extends SparkFunSuite {
   """
 |import org.apache.spark.storage.StorageLevel._
 |case class Foo(i: Int)
-|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
+|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_AND_DISK_2)
 |ret.count()
 |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
   """.stripMargin)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported

2018-01-12 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 7bd14cfd4 -> 54277398a


[SPARK-22975][SS] MetricsReporter should not throw exception when there was no 
progress reported

## What changes were proposed in this pull request?

`MetricsReporter ` assumes that there has been some progress for the query, ie. 
`lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.

The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.

## How was this patch tested?

added UT

Author: Marco Gaido 

Closes #20189 from mgaido91/SPARK-22975.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54277398
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54277398
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54277398

Branch: refs/heads/master
Commit: 54277398afbde92a38ba2802f4a7a3e5910533de
Parents: 7bd14cf
Author: Marco Gaido 
Authored: Fri Jan 12 11:25:37 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 12 11:25:37 2018 -0800

--
 .../execution/streaming/MetricsReporter.scala   | 21 +-
 .../sql/streaming/StreamingQuerySuite.scala | 23 
 2 files changed, 33 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/54277398/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
index b84e6ce..66b11ec 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
@@ -17,15 +17,11 @@
 
 package org.apache.spark.sql.execution.streaming
 
-import java.{util => ju}
-
-import scala.collection.mutable
-
 import com.codahale.metrics.{Gauge, MetricRegistry}
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.{Source => CodahaleSource}
-import org.apache.spark.util.Clock
+import org.apache.spark.sql.streaming.StreamingQueryProgress
 
 /**
  * Serves metrics from a [[org.apache.spark.sql.streaming.StreamingQuery]] to
@@ -39,14 +35,17 @@ class MetricsReporter(
 
   // Metric names should not have . in them, so that all the metrics of a 
query are identified
   // together in Ganglia as a single metric group
-  registerGauge("inputRate-total", () => 
stream.lastProgress.inputRowsPerSecond)
-  registerGauge("processingRate-total", () => 
stream.lastProgress.processedRowsPerSecond)
-  registerGauge("latency", () => 
stream.lastProgress.durationMs.get("triggerExecution").longValue())
-
-  private def registerGauge[T](name: String, f: () => T)(implicit num: 
Numeric[T]): Unit = {
+  registerGauge("inputRate-total", _.inputRowsPerSecond, 0.0)
+  registerGauge("processingRate-total", _.processedRowsPerSecond, 0.0)
+  registerGauge("latency", _.durationMs.get("triggerExecution").longValue(), 
0L)
+
+  private def registerGauge[T](
+  name: String,
+  f: StreamingQueryProgress => T,
+  default: T): Unit = {
 synchronized {
   metricRegistry.register(name, new Gauge[T] {
-override def getValue: T = f()
+override def getValue: T = 
Option(stream.lastProgress).map(f).getOrElse(default)
   })
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/54277398/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 2fa4595..76201c6 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -424,6 +424,29 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 }
   }
 
+  test("SPARK-22975: MetricsReporter defaults when there was no progress 
reported") {
+withSQLConf("spark.sql.streaming.metricsEnabled" -> "true") {
+  BlockingSource.latch = new CountDownLatch(1)
+  withTempDir { tempDir =>
+val sq = spark.readStream
+  .format("org.apache.spark.sql.streaming.util.BlockingSource")
+  .load()
+  .writeStream
+  .format("org.apache.spark.sql.streaming.util.BlockingSource")
+  .option("checkpointLocation", tempDir.toString)
+  .sta

spark git commit: [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported

2018-01-12 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 20eea20c7 -> 105ae8680


[SPARK-22975][SS] MetricsReporter should not throw exception when there was no 
progress reported

## What changes were proposed in this pull request?

`MetricsReporter ` assumes that there has been some progress for the query, ie. 
`lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.

The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.

## How was this patch tested?

added UT

Author: Marco Gaido 

Closes #20189 from mgaido91/SPARK-22975.

(cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/105ae868
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/105ae868
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/105ae868

Branch: refs/heads/branch-2.2
Commit: 105ae86801e2d1017ffad422085481f1b9038a1f
Parents: 20eea20
Author: Marco Gaido 
Authored: Fri Jan 12 11:25:37 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 12 11:25:59 2018 -0800

--
 .../execution/streaming/MetricsReporter.scala   | 21 +-
 .../sql/streaming/StreamingQuerySuite.scala | 23 
 2 files changed, 33 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/105ae868/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
index b84e6ce..66b11ec 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
@@ -17,15 +17,11 @@
 
 package org.apache.spark.sql.execution.streaming
 
-import java.{util => ju}
-
-import scala.collection.mutable
-
 import com.codahale.metrics.{Gauge, MetricRegistry}
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.{Source => CodahaleSource}
-import org.apache.spark.util.Clock
+import org.apache.spark.sql.streaming.StreamingQueryProgress
 
 /**
  * Serves metrics from a [[org.apache.spark.sql.streaming.StreamingQuery]] to
@@ -39,14 +35,17 @@ class MetricsReporter(
 
   // Metric names should not have . in them, so that all the metrics of a 
query are identified
   // together in Ganglia as a single metric group
-  registerGauge("inputRate-total", () => 
stream.lastProgress.inputRowsPerSecond)
-  registerGauge("processingRate-total", () => 
stream.lastProgress.processedRowsPerSecond)
-  registerGauge("latency", () => 
stream.lastProgress.durationMs.get("triggerExecution").longValue())
-
-  private def registerGauge[T](name: String, f: () => T)(implicit num: 
Numeric[T]): Unit = {
+  registerGauge("inputRate-total", _.inputRowsPerSecond, 0.0)
+  registerGauge("processingRate-total", _.processedRowsPerSecond, 0.0)
+  registerGauge("latency", _.durationMs.get("triggerExecution").longValue(), 
0L)
+
+  private def registerGauge[T](
+  name: String,
+  f: StreamingQueryProgress => T,
+  default: T): Unit = {
 synchronized {
   metricRegistry.register(name, new Gauge[T] {
-override def getValue: T = f()
+override def getValue: T = 
Option(stream.lastProgress).map(f).getOrElse(default)
   })
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/105ae868/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index ee5af65..01c34b1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -425,6 +425,29 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 }
   }
 
+  test("SPARK-22975: MetricsReporter defaults when there was no progress 
reported") {
+withSQLConf("spark.sql.streaming.metricsEnabled" -> "true") {
+  BlockingSource.latch = new CountDownLatch(1)
+  withTempDir { tempDir =>
+val sq = spark.readStream
+  .format("org.apache.spark.sql.streaming.util.BlockingSource")
+  .load()
+  .writeStream
+  .format("org.apache.spark.sql

spark git commit: [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported

2018-01-12 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 db27a9365 -> 02176f4c2


[SPARK-22975][SS] MetricsReporter should not throw exception when there was no 
progress reported

## What changes were proposed in this pull request?

`MetricsReporter ` assumes that there has been some progress for the query, ie. 
`lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.

The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.

## How was this patch tested?

added UT

Author: Marco Gaido 

Closes #20189 from mgaido91/SPARK-22975.

(cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02176f4c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02176f4c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02176f4c

Branch: refs/heads/branch-2.3
Commit: 02176f4c2f60342068669b215485ffd443346aed
Parents: db27a93
Author: Marco Gaido 
Authored: Fri Jan 12 11:25:37 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 12 11:25:45 2018 -0800

--
 .../execution/streaming/MetricsReporter.scala   | 21 +-
 .../sql/streaming/StreamingQuerySuite.scala | 23 
 2 files changed, 33 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02176f4c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
index b84e6ce..66b11ec 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetricsReporter.scala
@@ -17,15 +17,11 @@
 
 package org.apache.spark.sql.execution.streaming
 
-import java.{util => ju}
-
-import scala.collection.mutable
-
 import com.codahale.metrics.{Gauge, MetricRegistry}
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.{Source => CodahaleSource}
-import org.apache.spark.util.Clock
+import org.apache.spark.sql.streaming.StreamingQueryProgress
 
 /**
  * Serves metrics from a [[org.apache.spark.sql.streaming.StreamingQuery]] to
@@ -39,14 +35,17 @@ class MetricsReporter(
 
   // Metric names should not have . in them, so that all the metrics of a 
query are identified
   // together in Ganglia as a single metric group
-  registerGauge("inputRate-total", () => 
stream.lastProgress.inputRowsPerSecond)
-  registerGauge("processingRate-total", () => 
stream.lastProgress.processedRowsPerSecond)
-  registerGauge("latency", () => 
stream.lastProgress.durationMs.get("triggerExecution").longValue())
-
-  private def registerGauge[T](name: String, f: () => T)(implicit num: 
Numeric[T]): Unit = {
+  registerGauge("inputRate-total", _.inputRowsPerSecond, 0.0)
+  registerGauge("processingRate-total", _.processedRowsPerSecond, 0.0)
+  registerGauge("latency", _.durationMs.get("triggerExecution").longValue(), 
0L)
+
+  private def registerGauge[T](
+  name: String,
+  f: StreamingQueryProgress => T,
+  default: T): Unit = {
 synchronized {
   metricRegistry.register(name, new Gauge[T] {
-override def getValue: T = f()
+override def getValue: T = 
Option(stream.lastProgress).map(f).getOrElse(default)
   })
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/02176f4c/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 2fa4595..76201c6 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -424,6 +424,29 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 }
   }
 
+  test("SPARK-22975: MetricsReporter defaults when there was no progress 
reported") {
+withSQLConf("spark.sql.streaming.metricsEnabled" -> "true") {
+  BlockingSource.latch = new CountDownLatch(1)
+  withTempDir { tempDir =>
+val sq = spark.readStream
+  .format("org.apache.spark.sql.streaming.util.BlockingSource")
+  .load()
+  .writeStream
+  .format("org.apache.spark.sql

spark git commit: [SPARK-22956][SS] Bug fix for 2 streams union failover scenario

2018-01-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master c7572b79d -> 07ae39d0e


[SPARK-22956][SS] Bug fix for 2 streams union failover scenario

## What changes were proposed in this pull request?

This problem reported by yanlin-Lynn ivoson and LiangchangZ. Thanks!

When we union 2 streams from kafka or other sources, while one of them have no 
continues data coming and in the same time task restart, this will cause an 
`IllegalStateException`. This mainly cause because the code in 
[MicroBatchExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190)
 , while one stream has no continues data, its comittedOffset same with 
availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
not properly handled in KafkaSource. Also, maybe we should also consider this 
scenario in other Source.

## How was this patch tested?

Add a UT in KafkaSourceSuite.scala

Author: Yuanjian Li 

Closes #20150 from xuanyuanking/SPARK-22956.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07ae39d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07ae39d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07ae39d0

Branch: refs/heads/master
Commit: 07ae39d0ec1f03b1c73259373a8bb599694c7860
Parents: c7572b7
Author: Yuanjian Li 
Authored: Mon Jan 15 22:01:14 2018 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 15 22:01:14 2018 -0800

--
 .../apache/spark/sql/kafka010/KafkaSource.scala | 13 ++--
 .../spark/sql/kafka010/KafkaSourceSuite.scala   | 65 
 .../streaming/MicroBatchExecution.scala |  6 +-
 .../spark/sql/execution/streaming/memory.scala  |  6 ++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07ae39d0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
index e9cff04..864a92b 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
@@ -223,6 +223,14 @@ private[kafka010] class KafkaSource(
 
 logInfo(s"GetBatch called with start = $start, end = $end")
 val untilPartitionOffsets = KafkaSourceOffset.getPartitionOffsets(end)
+// On recovery, getBatch will get called before getOffset
+if (currentPartitionOffsets.isEmpty) {
+  currentPartitionOffsets = Some(untilPartitionOffsets)
+}
+if (start.isDefined && start.get == end) {
+  return sqlContext.internalCreateDataFrame(
+sqlContext.sparkContext.emptyRDD, schema, isStreaming = true)
+}
 val fromPartitionOffsets = start match {
   case Some(prevBatchEndOffset) =>
 KafkaSourceOffset.getPartitionOffsets(prevBatchEndOffset)
@@ -305,11 +313,6 @@ private[kafka010] class KafkaSource(
 logInfo("GetBatch generating RDD of offset range: " +
   offsetRanges.sortBy(_.topicPartition.toString).mkString(", "))
 
-// On recovery, getBatch will get called before getOffset
-if (currentPartitionOffsets.isEmpty) {
-  currentPartitionOffsets = Some(untilPartitionOffsets)
-}
-
 sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/07ae39d0/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 2034b9b..a0f5695 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -318,6 +318,71 @@ class KafkaSourceSuite extends KafkaSourceTest {
 )
   }
 
+  test("SPARK-22956: currentPartitionOffsets should be set when no new data 
comes in") {
+def getSpecificDF(range: Range.Inclusive): 
org.apache.spark.sql.Dataset[Int] = {
+  val topic = newTopic()
+  testUtils.createTopic(topic, partitions = 1)
+  testUtils.sendMessages(topic, range.map(_.toString).toArray, Some(0))
+
+  val reader = spark
+.readStream
+.format("kafka")
+.option("kafka.bootstrap.servers", testUtils.brokerAddress)
+.option("kafka.metadata.

spark git commit: [SPARK-22956][SS] Bug fix for 2 streams union failover scenario

2018-01-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 e2ffb9781 -> e58c4a929


[SPARK-22956][SS] Bug fix for 2 streams union failover scenario

## What changes were proposed in this pull request?

This problem reported by yanlin-Lynn ivoson and LiangchangZ. Thanks!

When we union 2 streams from kafka or other sources, while one of them have no 
continues data coming and in the same time task restart, this will cause an 
`IllegalStateException`. This mainly cause because the code in 
[MicroBatchExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190)
 , while one stream has no continues data, its comittedOffset same with 
availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
not properly handled in KafkaSource. Also, maybe we should also consider this 
scenario in other Source.

## How was this patch tested?

Add a UT in KafkaSourceSuite.scala

Author: Yuanjian Li 

Closes #20150 from xuanyuanking/SPARK-22956.

(cherry picked from commit 07ae39d0ec1f03b1c73259373a8bb599694c7860)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e58c4a92
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e58c4a92
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e58c4a92

Branch: refs/heads/branch-2.3
Commit: e58c4a929a5cbd2d611b3e07a29fcc93a827d980
Parents: e2ffb97
Author: Yuanjian Li 
Authored: Mon Jan 15 22:01:14 2018 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 15 22:01:23 2018 -0800

--
 .../apache/spark/sql/kafka010/KafkaSource.scala | 13 ++--
 .../spark/sql/kafka010/KafkaSourceSuite.scala   | 65 
 .../streaming/MicroBatchExecution.scala |  6 +-
 .../spark/sql/execution/streaming/memory.scala  |  6 ++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e58c4a92/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
index e9cff04..864a92b 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
@@ -223,6 +223,14 @@ private[kafka010] class KafkaSource(
 
 logInfo(s"GetBatch called with start = $start, end = $end")
 val untilPartitionOffsets = KafkaSourceOffset.getPartitionOffsets(end)
+// On recovery, getBatch will get called before getOffset
+if (currentPartitionOffsets.isEmpty) {
+  currentPartitionOffsets = Some(untilPartitionOffsets)
+}
+if (start.isDefined && start.get == end) {
+  return sqlContext.internalCreateDataFrame(
+sqlContext.sparkContext.emptyRDD, schema, isStreaming = true)
+}
 val fromPartitionOffsets = start match {
   case Some(prevBatchEndOffset) =>
 KafkaSourceOffset.getPartitionOffsets(prevBatchEndOffset)
@@ -305,11 +313,6 @@ private[kafka010] class KafkaSource(
 logInfo("GetBatch generating RDD of offset range: " +
   offsetRanges.sortBy(_.topicPartition.toString).mkString(", "))
 
-// On recovery, getBatch will get called before getOffset
-if (currentPartitionOffsets.isEmpty) {
-  currentPartitionOffsets = Some(untilPartitionOffsets)
-}
-
 sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e58c4a92/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 2034b9b..a0f5695 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -318,6 +318,71 @@ class KafkaSourceSuite extends KafkaSourceTest {
 )
   }
 
+  test("SPARK-22956: currentPartitionOffsets should be set when no new data 
comes in") {
+def getSpecificDF(range: Range.Inclusive): 
org.apache.spark.sql.Dataset[Int] = {
+  val topic = newTopic()
+  testUtils.createTopic(topic, partitions = 1)
+  testUtils.sendMessages(topic, range.map(_.toString).toArray, Some(0))
+
+  val reader = spark
+.readStream
+.format("kaf

spark git commit: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 50345a2aa -> a963980a6


Fix merge between 07ae39d0ec and 1667057851

## What changes were proposed in this pull request?

The first commit added a new test, and the second refactored the class the test 
was in. The automatic merge put the test in the wrong place.

## How was this patch tested?
-

Author: Jose Torres 

Closes #20289 from jose-torres/fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a963980a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a963980a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a963980a

Branch: refs/heads/master
Commit: a963980a6d2b4bef2c546aa33acf0aa501d2507b
Parents: 50345a2
Author: Jose Torres 
Authored: Tue Jan 16 22:27:28 2018 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 16 22:27:28 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaSourceSuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a963980a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 1acff61..62f6a34 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -479,11 +479,6 @@ class KafkaMicroBatchSourceSuite extends 
KafkaSourceSuiteBase {
 // `failOnDataLoss` is `false`, we should not fail the query
 assert(query.exception.isEmpty)
   }
-}
-
-class KafkaSourceSuiteBase extends KafkaSourceTest {
-
-  import testImplicits._
 
   test("SPARK-22956: currentPartitionOffsets should be set when no new data 
comes in") {
 def getSpecificDF(range: Range.Inclusive): 
org.apache.spark.sql.Dataset[Int] = {
@@ -549,6 +544,11 @@ class KafkaSourceSuiteBase extends KafkaSourceTest {
   CheckLastBatch(120 to 124: _*)
 )
   }
+}
+
+class KafkaSourceSuiteBase extends KafkaSourceTest {
+
+  import testImplicits._
 
   test("cannot stop Kafka stream") {
 val topic = newTopic()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b9339eee1 -> 00c744e40


Fix merge between 07ae39d0ec and 1667057851

## What changes were proposed in this pull request?

The first commit added a new test, and the second refactored the class the test 
was in. The automatic merge put the test in the wrong place.

## How was this patch tested?
-

Author: Jose Torres 

Closes #20289 from jose-torres/fix.

(cherry picked from commit a963980a6d2b4bef2c546aa33acf0aa501d2507b)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00c744e4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00c744e4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00c744e4

Branch: refs/heads/branch-2.3
Commit: 00c744e40be3a96f1fe7c377725703fc7b9ca3e3
Parents: b9339ee
Author: Jose Torres 
Authored: Tue Jan 16 22:27:28 2018 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 16 22:27:35 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaSourceSuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/00c744e4/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 1acff61..62f6a34 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -479,11 +479,6 @@ class KafkaMicroBatchSourceSuite extends 
KafkaSourceSuiteBase {
 // `failOnDataLoss` is `false`, we should not fail the query
 assert(query.exception.isEmpty)
   }
-}
-
-class KafkaSourceSuiteBase extends KafkaSourceTest {
-
-  import testImplicits._
 
   test("SPARK-22956: currentPartitionOffsets should be set when no new data 
comes in") {
 def getSpecificDF(range: Range.Inclusive): 
org.apache.spark.sql.Dataset[Int] = {
@@ -549,6 +544,11 @@ class KafkaSourceSuiteBase extends KafkaSourceTest {
   CheckLastBatch(120 to 124: _*)
 )
   }
+}
+
+class KafkaSourceSuiteBase extends KafkaSourceTest {
+
+  import testImplicits._
 
   test("cannot stop Kafka stream") {
 val topic = newTopic()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23093][SS] Don't change run id when reconfiguring a continuous processing query.

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 86a845031 -> e946c63dd


[SPARK-23093][SS] Don't change run id when reconfiguring a continuous 
processing query.

## What changes were proposed in this pull request?

Keep the run ID static, using a different ID for the epoch coordinator to avoid 
cross-execution message contamination.

## How was this patch tested?

new and existing unit tests

Author: Jose Torres 

Closes #20282 from jose-torres/fix-runid.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e946c63d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e946c63d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e946c63d

Branch: refs/heads/master
Commit: e946c63dd56d121cf898084ed7e9b5b0868b226e
Parents: 86a8450
Author: Jose Torres 
Authored: Wed Jan 17 13:58:44 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 13:58:44 2018 -0800

--
 .../datasources/v2/DataSourceV2ScanExec.scala   |  3 ++-
 .../datasources/v2/WriteToDataSourceV2.scala|  5 ++--
 .../execution/streaming/StreamExecution.scala   |  3 +--
 .../ContinuousDataSourceRDDIter.scala   | 10 
 .../continuous/ContinuousExecution.scala| 18 +-
 .../streaming/continuous/EpochCoordinator.scala |  9 +++
 .../apache/spark/sql/streaming/StreamTest.scala |  2 +-
 .../streaming/StreamingQueryListenerSuite.scala | 25 
 8 files changed, 54 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e946c63d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
index 8c64df0..beb6673 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
@@ -58,7 +58,8 @@ case class DataSourceV2ScanExec(
 
 case _: ContinuousReader =>
   EpochCoordinatorRef.get(
-sparkContext.getLocalProperty(ContinuousExecution.RUN_ID_KEY), 
sparkContext.env)
+  
sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+  sparkContext.env)
 .askSync[Unit](SetReaderPartitions(readTasks.size()))
   new ContinuousDataSourceRDD(sparkContext, sqlContext, readTasks)
 .asInstanceOf[RDD[InternalRow]]

http://git-wip-us.apache.org/repos/asf/spark/blob/e946c63d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
index a4a857f..3dbdae7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
@@ -64,7 +64,8 @@ case class WriteToDataSourceV2Exec(writer: 
DataSourceV2Writer, query: SparkPlan)
   val runTask = writer match {
 case w: ContinuousWriter =>
   EpochCoordinatorRef.get(
-sparkContext.getLocalProperty(ContinuousExecution.RUN_ID_KEY), 
sparkContext.env)
+
sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+sparkContext.env)
 .askSync[Unit](SetWriterPartitions(rdd.getNumPartitions))
 
   (context: TaskContext, iter: Iterator[InternalRow]) =>
@@ -135,7 +136,7 @@ object DataWritingSparkTask extends Logging {
   iter: Iterator[InternalRow]): WriterCommitMessage = {
 val dataWriter = writeTask.createDataWriter(context.partitionId(), 
context.attemptNumber())
 val epochCoordinator = EpochCoordinatorRef.get(
-  context.getLocalProperty(ContinuousExecution.RUN_ID_KEY),
+  context.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
   SparkEnv.get)
 val currentMsg: WriterCommitMessage = null
 var currentEpoch = 
context.getLocalProperty(ContinuousExecution.START_EPOCH_KEY).toLong

http://git-wip-us.apache.org/repos/asf/spark/blob/e946c63d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 
b/sql/core/src/main/scala/

spark git commit: [SPARK-23093][SS] Don't change run id when reconfiguring a continuous processing query.

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 dbd2a5566 -> 79ccd0cad


[SPARK-23093][SS] Don't change run id when reconfiguring a continuous 
processing query.

## What changes were proposed in this pull request?

Keep the run ID static, using a different ID for the epoch coordinator to avoid 
cross-execution message contamination.

## How was this patch tested?

new and existing unit tests

Author: Jose Torres 

Closes #20282 from jose-torres/fix-runid.

(cherry picked from commit e946c63dd56d121cf898084ed7e9b5b0868b226e)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79ccd0ca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79ccd0ca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79ccd0ca

Branch: refs/heads/branch-2.3
Commit: 79ccd0cadf09c41c0f4b5853a54798be17a20584
Parents: dbd2a55
Author: Jose Torres 
Authored: Wed Jan 17 13:58:44 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 13:58:53 2018 -0800

--
 .../datasources/v2/DataSourceV2ScanExec.scala   |  3 ++-
 .../datasources/v2/WriteToDataSourceV2.scala|  5 ++--
 .../execution/streaming/StreamExecution.scala   |  3 +--
 .../ContinuousDataSourceRDDIter.scala   | 10 
 .../continuous/ContinuousExecution.scala| 18 +-
 .../streaming/continuous/EpochCoordinator.scala |  9 +++
 .../apache/spark/sql/streaming/StreamTest.scala |  2 +-
 .../streaming/StreamingQueryListenerSuite.scala | 25 
 8 files changed, 54 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79ccd0ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
index 8c64df0..beb6673 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
@@ -58,7 +58,8 @@ case class DataSourceV2ScanExec(
 
 case _: ContinuousReader =>
   EpochCoordinatorRef.get(
-sparkContext.getLocalProperty(ContinuousExecution.RUN_ID_KEY), 
sparkContext.env)
+  
sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+  sparkContext.env)
 .askSync[Unit](SetReaderPartitions(readTasks.size()))
   new ContinuousDataSourceRDD(sparkContext, sqlContext, readTasks)
 .asInstanceOf[RDD[InternalRow]]

http://git-wip-us.apache.org/repos/asf/spark/blob/79ccd0ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
index a4a857f..3dbdae7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
@@ -64,7 +64,8 @@ case class WriteToDataSourceV2Exec(writer: 
DataSourceV2Writer, query: SparkPlan)
   val runTask = writer match {
 case w: ContinuousWriter =>
   EpochCoordinatorRef.get(
-sparkContext.getLocalProperty(ContinuousExecution.RUN_ID_KEY), 
sparkContext.env)
+
sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+sparkContext.env)
 .askSync[Unit](SetWriterPartitions(rdd.getNumPartitions))
 
   (context: TaskContext, iter: Iterator[InternalRow]) =>
@@ -135,7 +136,7 @@ object DataWritingSparkTask extends Logging {
   iter: Iterator[InternalRow]): WriterCommitMessage = {
 val dataWriter = writeTask.createDataWriter(context.partitionId(), 
context.attemptNumber())
 val epochCoordinator = EpochCoordinatorRef.get(
-  context.getLocalProperty(ContinuousExecution.RUN_ID_KEY),
+  context.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
   SparkEnv.get)
 val currentMsg: WriterCommitMessage = null
 var currentEpoch = 
context.getLocalProperty(ContinuousExecution.START_EPOCH_KEY).toLong

http://git-wip-us.apache.org/repos/asf/spark/blob/79ccd0ca/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
--
diff --git 
a/sql/core

spark git commit: [SPARK-23119][SS] Minor fixes to V2 streaming APIs

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 7823d43ec -> bac0d661a


[SPARK-23119][SS] Minor fixes to V2 streaming APIs

## What changes were proposed in this pull request?

- Added `InterfaceStability.Evolving` annotations
- Improved docs.

## How was this patch tested?
Existing tests.

Author: Tathagata Das 

Closes #20286 from tdas/SPARK-23119.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bac0d661
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bac0d661
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bac0d661

Branch: refs/heads/master
Commit: bac0d661af6092dd26638223156827aceb901229
Parents: 7823d43
Author: Tathagata Das 
Authored: Wed Jan 17 16:40:02 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:40:02 2018 -0800

--
 .../v2/streaming/ContinuousReadSupport.java   |  2 ++
 .../v2/streaming/reader/ContinuousDataReader.java |  2 ++
 .../v2/streaming/reader/ContinuousReader.java |  9 +++--
 .../v2/streaming/reader/MicroBatchReader.java |  5 +
 .../sql/sources/v2/streaming/reader/Offset.java   | 18 +-
 .../v2/streaming/reader/PartitionOffset.java  |  3 +++
 .../sql/sources/v2/writer/DataSourceV2Writer.java |  5 -
 7 files changed, 36 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bac0d661/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
index 3136cee..9a93a80 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
@@ -19,6 +19,7 @@ package org.apache.spark.sql.sources.v2.streaming;
 
 import java.util.Optional;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.apache.spark.sql.sources.v2.DataSourceV2;
 import org.apache.spark.sql.sources.v2.DataSourceV2Options;
 import org.apache.spark.sql.sources.v2.streaming.reader.ContinuousReader;
@@ -28,6 +29,7 @@ import org.apache.spark.sql.types.StructType;
  * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
  * provide data reading ability for continuous stream processing.
  */
+@InterfaceStability.Evolving
 public interface ContinuousReadSupport extends DataSourceV2 {
   /**
* Creates a {@link ContinuousReader} to scan the data from this data source.

http://git-wip-us.apache.org/repos/asf/spark/blob/bac0d661/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
index ca9a290..3f13a4d 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
@@ -17,11 +17,13 @@
 
 package org.apache.spark.sql.sources.v2.streaming.reader;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.apache.spark.sql.sources.v2.reader.DataReader;
 
 /**
  * A variation on {@link DataReader} for use with streaming in continuous 
processing mode.
  */
+@InterfaceStability.Evolving
 public interface ContinuousDataReader extends DataReader {
 /**
  * Get the offset of the current record, or the start offset if no records 
have been read.

http://git-wip-us.apache.org/repos/asf/spark/blob/bac0d661/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
index f0b2058..745f1ce 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.sources.v2.streaming.reader;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.apache.spark.sql.execution.streaming.BaseStreamingSource;
 import org.apache.spark.sql.sources.v2.reader.Data

spark git commit: [SPARK-23119][SS] Minor fixes to V2 streaming APIs

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b84c2a306 -> 9783aea2c


[SPARK-23119][SS] Minor fixes to V2 streaming APIs

## What changes were proposed in this pull request?

- Added `InterfaceStability.Evolving` annotations
- Improved docs.

## How was this patch tested?
Existing tests.

Author: Tathagata Das 

Closes #20286 from tdas/SPARK-23119.

(cherry picked from commit bac0d661af6092dd26638223156827aceb901229)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9783aea2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9783aea2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9783aea2

Branch: refs/heads/branch-2.3
Commit: 9783aea2c75700e7ce9551ccfd33e43765de8981
Parents: b84c2a3
Author: Tathagata Das 
Authored: Wed Jan 17 16:40:02 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:40:11 2018 -0800

--
 .../v2/streaming/ContinuousReadSupport.java   |  2 ++
 .../v2/streaming/reader/ContinuousDataReader.java |  2 ++
 .../v2/streaming/reader/ContinuousReader.java |  9 +++--
 .../v2/streaming/reader/MicroBatchReader.java |  5 +
 .../sql/sources/v2/streaming/reader/Offset.java   | 18 +-
 .../v2/streaming/reader/PartitionOffset.java  |  3 +++
 .../sql/sources/v2/writer/DataSourceV2Writer.java |  5 -
 7 files changed, 36 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9783aea2/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
index 3136cee..9a93a80 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/ContinuousReadSupport.java
@@ -19,6 +19,7 @@ package org.apache.spark.sql.sources.v2.streaming;
 
 import java.util.Optional;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.apache.spark.sql.sources.v2.DataSourceV2;
 import org.apache.spark.sql.sources.v2.DataSourceV2Options;
 import org.apache.spark.sql.sources.v2.streaming.reader.ContinuousReader;
@@ -28,6 +29,7 @@ import org.apache.spark.sql.types.StructType;
  * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
  * provide data reading ability for continuous stream processing.
  */
+@InterfaceStability.Evolving
 public interface ContinuousReadSupport extends DataSourceV2 {
   /**
* Creates a {@link ContinuousReader} to scan the data from this data source.

http://git-wip-us.apache.org/repos/asf/spark/blob/9783aea2/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
index ca9a290..3f13a4d 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousDataReader.java
@@ -17,11 +17,13 @@
 
 package org.apache.spark.sql.sources.v2.streaming.reader;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.apache.spark.sql.sources.v2.reader.DataReader;
 
 /**
  * A variation on {@link DataReader} for use with streaming in continuous 
processing mode.
  */
+@InterfaceStability.Evolving
 public interface ContinuousDataReader extends DataReader {
 /**
  * Get the offset of the current record, or the start offset if no records 
have been read.

http://git-wip-us.apache.org/repos/asf/spark/blob/9783aea2/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
index f0b2058..745f1ce 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.sources.v2.streaming.reader;
 
+import org.apache.spark.annotation.InterfaceStability;
 import org.ap

spark git commit: [SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master bac0d661a -> 1002bd6b2


[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das 

Closes #20255 from tdas/join-docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1002bd6b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1002bd6b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1002bd6b

Branch: refs/heads/master
Commit: 1002bd6b23ff78a010ca259ea76988ef4c478c6e
Parents: bac0d66
Author: Tathagata Das 
Authored: Wed Jan 17 16:41:43 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:41:43 2018 -0800

--
 docs/structured-streaming-programming-guide.md | 338 +++-
 1 file changed, 326 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1002bd6b/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index de13e28..1779a42 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1051,7 +1051,19 @@ output mode.
 
 
 ### Join Operations
-Streaming DataFrames can be joined with static DataFrames to create new 
streaming DataFrames. Here are a few examples.
+Structured Streaming supports joining a streaming Dataset/DataFrame with a 
static Dataset/DataFrame
+as well as another streaming Dataset/DataFrame. The result of the streaming 
join is generated
+incrementally, similar to the results of streaming aggregations in the 
previous section. In this
+section we will explore what type of joins (i.e. inner, outer, etc.) are 
supported in the above
+cases. Note that in all the supported join types, the result of the join with 
a streaming
+Dataset/DataFrame will be the exactly the same as if it was with a static 
Dataset/DataFrame
+containing the same data in the stream.
+
+
+ Stream-static joins
+
+Since the introduction in Spark 2.0, Structured Streaming has supported joins 
(inner join and some
+type of outer joins) between a streaming and a static DataFrame/Dataset. Here 
is a simple example.
 
 
 
@@ -1089,6 +1101,300 @@ streamingDf.join(staticDf, "type", "right_join")  # 
right outer join with a stat
 
 
 
+Note that stream-static joins are not stateful, so no state management is 
necessary.
+However, a few types of stream-static outer joins are not yet supported.
+These are listed at the [end of this Join 
section](#support-matrix-for-joins-in-streaming-queries).
+
+ Stream-stream Joins
+In Spark 2.3, we have added support for stream-stream joins, that is, you can 
join two streaming
+Datasets/DataFrames. The challenge of generating join results between two data 
streams is that,
+at any point of time, the view of the dataset is incomplete for both sides of 
the join making
+it much harder to find matches between inputs. Any row received from one input 
stream can match
+with any future, yet-to-be-received row from the other input stream. Hence, 
for both the input
+streams, we buffer past input as streaming state, so that we can match every 
future input with
+past input and accordingly generate joined results. Furthermore, similar to 
streaming aggregations,
+we automatically handle late, out-of-order data and can limit the state using 
watermarks.
+Letâs discuss the different types of supported stream-stream joins and how 
to use them.
+
+# Inner Joins with optional Watermarking
+Inner joins on any kind of columns along with any kind of join conditions are 
supported.
+However, as the stream runs, the size of streaming state will keep growing 
indefinitely as
+*all* past input must be saved as the any new input can match with any input 
from the past.
+To avoid unbounded state, you have to define additional join conditions such 
that indefinitely
+old inputs cannot match with future inputs and therefore can be cleared from 
the state.
+In other words, you will have to do the following additional steps in the join.
+
+1. Define watermark delays on both inputs such that the engine knows how 
delayed the input can be
+(similar to streaming a

spark git commit: [SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 9783aea2c -> 050c1e24e


[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das 

Closes #20255 from tdas/join-docs.

(cherry picked from commit 1002bd6b23ff78a010ca259ea76988ef4c478c6e)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/050c1e24
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/050c1e24
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/050c1e24

Branch: refs/heads/branch-2.3
Commit: 050c1e24e506ff224bcf4e3e458e57fbd216765c
Parents: 9783aea
Author: Tathagata Das 
Authored: Wed Jan 17 16:41:43 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:41:49 2018 -0800

--
 docs/structured-streaming-programming-guide.md | 338 +++-
 1 file changed, 326 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/050c1e24/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index de13e28..1779a42 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1051,7 +1051,19 @@ output mode.
 
 
 ### Join Operations
-Streaming DataFrames can be joined with static DataFrames to create new 
streaming DataFrames. Here are a few examples.
+Structured Streaming supports joining a streaming Dataset/DataFrame with a 
static Dataset/DataFrame
+as well as another streaming Dataset/DataFrame. The result of the streaming 
join is generated
+incrementally, similar to the results of streaming aggregations in the 
previous section. In this
+section we will explore what type of joins (i.e. inner, outer, etc.) are 
supported in the above
+cases. Note that in all the supported join types, the result of the join with 
a streaming
+Dataset/DataFrame will be the exactly the same as if it was with a static 
Dataset/DataFrame
+containing the same data in the stream.
+
+
+ Stream-static joins
+
+Since the introduction in Spark 2.0, Structured Streaming has supported joins 
(inner join and some
+type of outer joins) between a streaming and a static DataFrame/Dataset. Here 
is a simple example.
 
 
 
@@ -1089,6 +1101,300 @@ streamingDf.join(staticDf, "type", "right_join")  # 
right outer join with a stat
 
 
 
+Note that stream-static joins are not stateful, so no state management is 
necessary.
+However, a few types of stream-static outer joins are not yet supported.
+These are listed at the [end of this Join 
section](#support-matrix-for-joins-in-streaming-queries).
+
+ Stream-stream Joins
+In Spark 2.3, we have added support for stream-stream joins, that is, you can 
join two streaming
+Datasets/DataFrames. The challenge of generating join results between two data 
streams is that,
+at any point of time, the view of the dataset is incomplete for both sides of 
the join making
+it much harder to find matches between inputs. Any row received from one input 
stream can match
+with any future, yet-to-be-received row from the other input stream. Hence, 
for both the input
+streams, we buffer past input as streaming state, so that we can match every 
future input with
+past input and accordingly generate joined results. Furthermore, similar to 
streaming aggregations,
+we automatically handle late, out-of-order data and can limit the state using 
watermarks.
+Letâs discuss the different types of supported stream-stream joins and how 
to use them.
+
+# Inner Joins with optional Watermarking
+Inner joins on any kind of columns along with any kind of join conditions are 
supported.
+However, as the stream runs, the size of streaming state will keep growing 
indefinitely as
+*all* past input must be saved as the any new input can match with any input 
from the past.
+To avoid unbounded state, you have to define additional join conditions such 
that indefinitely
+old inputs cannot match with future inputs and therefore can be cleared from 
the state.
+In other words, you will have to do the following additional steps in the join.
+
+1. Define waterm

spark git commit: [SPARK-21996][SQL] read files with space in name for streaming

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 1002bd6b2 -> 021947020


[SPARK-21996][SQL] read files with space in name for streaming

## What changes were proposed in this pull request?

Structured streaming is now able to read files with space in file name 
(previously it would skip the file and output a warning)

## How was this patch tested?

Added new unit test.

Author: Xiayun Sun 

Closes #19247 from xysun/SPARK-21996.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02194702
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02194702
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02194702

Branch: refs/heads/master
Commit: 02194702068291b3af77486d01029fb848c36d7b
Parents: 1002bd6
Author: Xiayun Sun 
Authored: Wed Jan 17 16:42:38 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:42:38 2018 -0800

--
 .../execution/streaming/FileStreamSource.scala  |  2 +-
 .../sql/streaming/FileStreamSourceSuite.scala   | 50 +++-
 2 files changed, 49 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02194702/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
index 0debd7d..8c016ab 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
@@ -166,7 +166,7 @@ class FileStreamSource(
 val newDataSource =
   DataSource(
 sparkSession,
-paths = files.map(_.path),
+paths = files.map(f => new Path(new URI(f.path)).toString),
 userSpecifiedSchema = Some(schema),
 partitionColumns = partitionColumns,
 className = fileFormatClassName,

http://git-wip-us.apache.org/repos/asf/spark/blob/02194702/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
index 39bb572..5bb0f4d 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
@@ -74,11 +74,11 @@ abstract class FileStreamSourceTest
 protected def addData(source: FileStreamSource): Unit
   }
 
-  case class AddTextFileData(content: String, src: File, tmp: File)
+  case class AddTextFileData(content: String, src: File, tmp: File, 
tmpFilePrefix: String = "text")
 extends AddFileData {
 
 override def addData(source: FileStreamSource): Unit = {
-  val tempFile = Utils.tempFileWith(new File(tmp, "text"))
+  val tempFile = Utils.tempFileWith(new File(tmp, tmpFilePrefix))
   val finalFile = new File(src, tempFile.getName)
   src.mkdirs()
   require(stringToFile(tempFile, content).renameTo(finalFile))
@@ -408,6 +408,52 @@ class FileStreamSourceSuite extends FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
+withTempDirs { case (src, tmp) =>
+  val textStream = createFileStream("text", src.getCanonicalPath)
+  val filtered = textStream.filter($"value" contains "keep")
+
+  testStream(filtered)(
+AddTextFileData("drop1\nkeep2\nkeep3", src, tmp, "text text"),
+CheckAnswer("keep2", "keep3")
+  )
+}
+  }
+
+  test("SPARK-21996 read from text files generated by file sink -- file name 
has space") {
+val testTableName = "FileStreamSourceTest"
+withTable(testTableName) {
+  withTempDirs { case (src, checkpoint) =>
+val output = new File(src, "text text")
+val inputData = MemoryStream[String]
+val ds = inputData.toDS()
+
+val query = ds.writeStream
+  .option("checkpointLocation", checkpoint.getCanonicalPath)
+  .format("text")
+  .start(output.getCanonicalPath)
+
+try {
+  inputData.addData("foo")
+  failAfter(streamingTimeout) {
+query.processAllAvailable()
+  }
+} finally {
+  query.stop()
+}
+
+val df2 = spark.readStream.format("text").load(output.getCanonicalPath)
+val query2 = 
df2.writeStream.format("memory").queryName(testTableName).start()
+try {
+  query2.processAllAvailable()
+  checkDatasetUnorderly(sp

spark git commit: [SPARK-21996][SQL] read files with space in name for streaming

2018-01-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 050c1e24e -> f2688ef0f


[SPARK-21996][SQL] read files with space in name for streaming

## What changes were proposed in this pull request?

Structured streaming is now able to read files with space in file name 
(previously it would skip the file and output a warning)

## How was this patch tested?

Added new unit test.

Author: Xiayun Sun 

Closes #19247 from xysun/SPARK-21996.

(cherry picked from commit 02194702068291b3af77486d01029fb848c36d7b)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2688ef0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2688ef0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2688ef0

Branch: refs/heads/branch-2.3
Commit: f2688ef0fbd9d355d13ce4056d35e99970f4cd47
Parents: 050c1e2
Author: Xiayun Sun 
Authored: Wed Jan 17 16:42:38 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 17 16:42:45 2018 -0800

--
 .../execution/streaming/FileStreamSource.scala  |  2 +-
 .../sql/streaming/FileStreamSourceSuite.scala   | 50 +++-
 2 files changed, 49 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f2688ef0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
index 0debd7d..8c016ab 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
@@ -166,7 +166,7 @@ class FileStreamSource(
 val newDataSource =
   DataSource(
 sparkSession,
-paths = files.map(_.path),
+paths = files.map(f => new Path(new URI(f.path)).toString),
 userSpecifiedSchema = Some(schema),
 partitionColumns = partitionColumns,
 className = fileFormatClassName,

http://git-wip-us.apache.org/repos/asf/spark/blob/f2688ef0/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
index 39bb572..5bb0f4d 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
@@ -74,11 +74,11 @@ abstract class FileStreamSourceTest
 protected def addData(source: FileStreamSource): Unit
   }
 
-  case class AddTextFileData(content: String, src: File, tmp: File)
+  case class AddTextFileData(content: String, src: File, tmp: File, 
tmpFilePrefix: String = "text")
 extends AddFileData {
 
 override def addData(source: FileStreamSource): Unit = {
-  val tempFile = Utils.tempFileWith(new File(tmp, "text"))
+  val tempFile = Utils.tempFileWith(new File(tmp, tmpFilePrefix))
   val finalFile = new File(src, tempFile.getName)
   src.mkdirs()
   require(stringToFile(tempFile, content).renameTo(finalFile))
@@ -408,6 +408,52 @@ class FileStreamSourceSuite extends FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
+withTempDirs { case (src, tmp) =>
+  val textStream = createFileStream("text", src.getCanonicalPath)
+  val filtered = textStream.filter($"value" contains "keep")
+
+  testStream(filtered)(
+AddTextFileData("drop1\nkeep2\nkeep3", src, tmp, "text text"),
+CheckAnswer("keep2", "keep3")
+  )
+}
+  }
+
+  test("SPARK-21996 read from text files generated by file sink -- file name 
has space") {
+val testTableName = "FileStreamSourceTest"
+withTable(testTableName) {
+  withTempDirs { case (src, checkpoint) =>
+val output = new File(src, "text text")
+val inputData = MemoryStream[String]
+val ds = inputData.toDS()
+
+val query = ds.writeStream
+  .option("checkpointLocation", checkpoint.getCanonicalPath)
+  .format("text")
+  .start(output.getCanonicalPath)
+
+try {
+  inputData.addData("foo")
+  failAfter(streamingTimeout) {
+query.processAllAvailable()
+  }
+} finally {
+  query.stop()
+}
+
+val df2 = spark.readStream.format("text").load(output.getCanonicalPath)
+val query2 = 
df2.writeStream.format("memory").queryName(testTa

spark git commit: [SPARK-23198][SS][TEST] Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution

2018-01-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 30272c668 -> 500c94434


[SPARK-23198][SS][TEST] Fix 
KafkaContinuousSourceStressForDontFailOnDataLossSuite to test 
ContinuousExecution

## What changes were proposed in this pull request?

Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on 
`MicroBatchExecution`. It should test `ContinuousExecution`.

## How was this patch tested?

Pass the updated test suite.

Author: Dongjoon Hyun 

Closes #20374 from dongjoon-hyun/SPARK-23198.

(cherry picked from commit bc9641d9026aeae3571915b003ac971f6245d53c)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/500c9443
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/500c9443
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/500c9443

Branch: refs/heads/branch-2.3
Commit: 500c94434d8f5267b1488accd176cf54b69e6ba4
Parents: 30272c6
Author: Dongjoon Hyun 
Authored: Wed Jan 24 12:58:44 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 24 12:58:51 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala  | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/500c9443/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
index b3dade4..a7083fa 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
@@ -91,6 +91,7 @@ class KafkaContinuousSourceStressForDontFailOnDataLossSuite
 ds.writeStream
   .format("memory")
   .queryName("memory")
+  .trigger(Trigger.Continuous("1 second"))
   .start()
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23198][SS][TEST] Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution

2018-01-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 0e178e152 -> bc9641d90


[SPARK-23198][SS][TEST] Fix 
KafkaContinuousSourceStressForDontFailOnDataLossSuite to test 
ContinuousExecution

## What changes were proposed in this pull request?

Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on 
`MicroBatchExecution`. It should test `ContinuousExecution`.

## How was this patch tested?

Pass the updated test suite.

Author: Dongjoon Hyun 

Closes #20374 from dongjoon-hyun/SPARK-23198.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc9641d9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc9641d9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc9641d9

Branch: refs/heads/master
Commit: bc9641d9026aeae3571915b003ac971f6245d53c
Parents: 0e178e1
Author: Dongjoon Hyun 
Authored: Wed Jan 24 12:58:44 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 24 12:58:44 2018 -0800

--
 .../org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala  | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bc9641d9/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
index b3dade4..a7083fa 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala
@@ -91,6 +91,7 @@ class KafkaContinuousSourceStressForDontFailOnDataLossSuite
 ds.writeStream
   .format("memory")
   .queryName("memory")
+  .trigger(Trigger.Continuous("1 second"))
   .start()
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice

2018-01-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 94c67a76e -> 073744985


[SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice

## What changes were proposed in this pull request?

KafkaSourceSuiteBase should be abstract class, otherwise KafkaSourceSuiteBase 
will also run.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #20412 from zsxwing/SPARK-23242.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07374498
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07374498
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07374498

Branch: refs/heads/master
Commit: 073744985f439ca90afb9bd0bbc1332c53f7b4bb
Parents: 94c67a7
Author: Shixiong Zhu 
Authored: Fri Jan 26 16:09:57 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 26 16:09:57 2018 -0800

--
 .../scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07374498/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 27dbb3f..c4cb1bc 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -546,7 +546,7 @@ class KafkaMicroBatchSourceSuite extends 
KafkaSourceSuiteBase {
   }
 }
 
-class KafkaSourceSuiteBase extends KafkaSourceTest {
+abstract class KafkaSourceSuiteBase extends KafkaSourceTest {
 
   import testImplicits._
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice

2018-01-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 30d16e116 -> 7aaf23cf8


[SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice

## What changes were proposed in this pull request?

KafkaSourceSuiteBase should be abstract class, otherwise KafkaSourceSuiteBase 
will also run.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #20412 from zsxwing/SPARK-23242.

(cherry picked from commit 073744985f439ca90afb9bd0bbc1332c53f7b4bb)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aaf23cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aaf23cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aaf23cf

Branch: refs/heads/branch-2.3
Commit: 7aaf23cf8ab871a8e8877ec82183656ae5f4be7b
Parents: 30d16e1
Author: Shixiong Zhu 
Authored: Fri Jan 26 16:09:57 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 26 16:10:04 2018 -0800

--
 .../scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7aaf23cf/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 27dbb3f..c4cb1bc 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -546,7 +546,7 @@ class KafkaMicroBatchSourceSuite extends 
KafkaSourceSuiteBase {
   }
 }
 
-class KafkaSourceSuiteBase extends KafkaSourceTest {
+abstract class KafkaSourceSuiteBase extends KafkaSourceTest {
 
   import testImplicits._
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest

2018-01-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 234c854bd -> 65600bfdb


[SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest

## What changes were proposed in this pull request?

`lastExecution.executedPlan` is lazy val so accessing it in StreamTest may need 
to acquire the lock of `lastExecution`. It may be waiting forever when the 
streaming thread is holding it and running a continuous Spark job.

This PR changes to check if `s.lastExecution` is null to avoid accessing 
`lastExecution.executedPlan`.

## How was this patch tested?

Jenkins

Author: Jose Torres 

Closes #20413 from zsxwing/SPARK-23245.

(cherry picked from commit 6328868e524121bd00595959d6d059f74e038a6b)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65600bfd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65600bfd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65600bfd

Branch: refs/heads/branch-2.3
Commit: 65600bfdb9417e5f2bd2e40312e139f592f238e9
Parents: 234c854
Author: Jose Torres 
Authored: Fri Jan 26 23:06:03 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 26 23:06:11 2018 -0800

--
 .../src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65600bfd/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index efdb0e0..d643356 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -472,7 +472,7 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with TimeLimits with Be
   currentStream.awaitInitialization(streamingTimeout.toMillis)
   currentStream match {
 case s: ContinuousExecution => 
eventually("IncrementalExecution was not created") {
-  s.lastExecution.executedPlan // will fail if lastExecution 
is null
+  assert(s.lastExecution != null)
 }
 case _ =>
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest

2018-01-26 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master e7bc9f052 -> 6328868e5


[SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest

## What changes were proposed in this pull request?

`lastExecution.executedPlan` is lazy val so accessing it in StreamTest may need 
to acquire the lock of `lastExecution`. It may be waiting forever when the 
streaming thread is holding it and running a continuous Spark job.

This PR changes to check if `s.lastExecution` is null to avoid accessing 
`lastExecution.executedPlan`.

## How was this patch tested?

Jenkins

Author: Jose Torres 

Closes #20413 from zsxwing/SPARK-23245.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6328868e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6328868e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6328868e

Branch: refs/heads/master
Commit: 6328868e524121bd00595959d6d059f74e038a6b
Parents: e7bc9f0
Author: Jose Torres 
Authored: Fri Jan 26 23:06:03 2018 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 26 23:06:03 2018 -0800

--
 .../src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6328868e/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index efdb0e0..d643356 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -472,7 +472,7 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with TimeLimits with Be
   currentStream.awaitInitialization(streamingTimeout.toMillis)
   currentStream match {
 case s: ContinuousExecution => 
eventually("IncrementalExecution was not created") {
-  s.lastExecution.executedPlan // will fail if lastExecution 
is null
+  assert(s.lastExecution != null)
 }
 case _ =>
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23400][SQL] Add a constructors for ScalaUDF

2018-02-13 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master d58fe2883 -> 2ee76c22b


[SPARK-23400][SQL] Add a constructors for ScalaUDF

## What changes were proposed in this pull request?

In this upcoming 2.3 release, we changed the interface of `ScalaUDF`. 
Unfortunately, some Spark packages (e.g., spark-deep-learning) are using our 
internal class `ScalaUDF`. In the release 2.3, we added new parameters into 
this class. The users hit the binary compatibility issues and got the exception:

```
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V
```

This PR is to improve the backward compatibility. However, we definitely should 
not encourage the external packages to use our internal classes. This might 
make us hard to maintain/develop the codes in Spark.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20591 from gatorsmile/scalaUDF.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ee76c22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ee76c22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ee76c22

Branch: refs/heads/master
Commit: 2ee76c22b6e48e643694c9475e5f0d37124215e7
Parents: d58fe28
Author: gatorsmile 
Authored: Tue Feb 13 11:56:49 2018 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 13 11:56:49 2018 -0800

--
 .../apache/spark/sql/catalyst/expressions/ScalaUDF.scala | 11 +++
 1 file changed, 11 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2ee76c22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
index 388ef42..989c023 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
@@ -49,6 +49,17 @@ case class ScalaUDF(
 udfDeterministic: Boolean = true)
   extends Expression with ImplicitCastInputTypes with NonSQLExpression with 
UserDefinedExpression {
 
+  // The constructor for SPARK 2.1 and 2.2
+  def this(
+  function: AnyRef,
+  dataType: DataType,
+  children: Seq[Expression],
+  inputTypes: Seq[DataType],
+  udfName: Option[String]) = {
+this(
+  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+  }
+
   override lazy val deterministic: Boolean = udfDeterministic && 
children.forall(_.deterministic)
 
   override def toString: String =


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23400][SQL] Add a constructors for ScalaUDF

2018-02-13 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 320ffb130 -> 4f6a457d4


[SPARK-23400][SQL] Add a constructors for ScalaUDF

## What changes were proposed in this pull request?

In this upcoming 2.3 release, we changed the interface of `ScalaUDF`. 
Unfortunately, some Spark packages (e.g., spark-deep-learning) are using our 
internal class `ScalaUDF`. In the release 2.3, we added new parameters into 
this class. The users hit the binary compatibility issues and got the exception:

```
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V
```

This PR is to improve the backward compatibility. However, we definitely should 
not encourage the external packages to use our internal classes. This might 
make us hard to maintain/develop the codes in Spark.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20591 from gatorsmile/scalaUDF.

(cherry picked from commit 2ee76c22b6e48e643694c9475e5f0d37124215e7)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f6a457d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f6a457d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f6a457d

Branch: refs/heads/branch-2.3
Commit: 4f6a457d464096d791e13e57c55bcf23c01c418f
Parents: 320ffb1
Author: gatorsmile 
Authored: Tue Feb 13 11:56:49 2018 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 13 11:56:57 2018 -0800

--
 .../apache/spark/sql/catalyst/expressions/ScalaUDF.scala | 11 +++
 1 file changed, 11 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4f6a457d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
index 388ef42..989c023 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
@@ -49,6 +49,17 @@ case class ScalaUDF(
 udfDeterministic: Boolean = true)
   extends Expression with ImplicitCastInputTypes with NonSQLExpression with 
UserDefinedExpression {
 
+  // The constructor for SPARK 2.1 and 2.2
+  def this(
+  function: AnyRef,
+  dataType: DataType,
+  children: Seq[Expression],
+  inputTypes: Seq[DataType],
+  udfName: Option[String]) = {
+this(
+  function, dataType, children, inputTypes, udfName, nullable = true, 
udfDeterministic = true)
+  }
+
   override lazy val deterministic: Boolean = udfDeterministic && 
children.forall(_.deterministic)
 
   override def toString: String =


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23434][SQL] Spark should not warn `metadata directory` for a HDFS file path

2018-02-20 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 83c008762 -> 3e48f3b9e


[SPARK-23434][SQL] Spark should not warn `metadata directory` for a HDFS file 
path

## What changes were proposed in this pull request?

In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), it 
warns with a wrong warning message during looking up 
`people.json/_spark_metadata`. The root cause of this situation is the 
difference between `LocalFileSystem` and `DistributedFileSystem`. 
`LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` 
raises `org.apache.hadoop.security.AccessControlException`.

```scala
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
```

After this PR,
```scala
scala> spark.read.json("hdfs:///tmp/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun 

Closes #20616 from dongjoon-hyun/SPARK-23434.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3e48f3b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3e48f3b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3e48f3b9

Branch: refs/heads/master
Commit: 3e48f3b9ee7645e4218ad3ff7559e578d4bd9667
Parents: 83c0087
Author: Dongjoon Hyun 
Authored: Tue Feb 20 16:02:44 2018 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 20 16:02:44 2018 -0800

--
 .../spark/sql/execution/streaming/FileStreamSink.scala   | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3e48f3b9/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
index 2715fa9..87a17ce 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
@@ -42,9 +42,11 @@ object FileStreamSink extends Logging {
 try {
   val hdfsPath = new Path(singlePath)
   val fs = hdfsPath.getFileSystem(hadoopConf)
-  val metadataPath = new Path(hdfsPath, metadataDir)
-  val res = fs.exists(metadataPath)
-  res
+  if (fs.isDirectory(hdfsPath)) {
+fs.exists(new Path(hdfsPath, metadataDir))
+  } else {
+false
+  }
 } catch {
   case NonFatal(e) =>
 logWarning(s"Error while looking for metadata directory.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist

2018-02-21 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 3fd0ccb13 -> 744d5af65


[SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist

## What changes were proposed in this pull request?

The issue here is `AppStatusStore.lastStageAttempt` will return the next 
available stage in the store when a stage doesn't exist.

This PR adds `last(stageId)` to ensure it returns a correct `StageData`

## How was this patch tested?

The new unit test.

Author: Shixiong Zhu 

Closes #20654 from zsxwing/SPARK-23481.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/744d5af6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/744d5af6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/744d5af6

Branch: refs/heads/master
Commit: 744d5af652ee8cece361cbca31e5201134e0fb42
Parents: 3fd0ccb
Author: Shixiong Zhu 
Authored: Wed Feb 21 15:37:28 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 21 15:37:28 2018 -0800

--
 .../apache/spark/status/AppStatusStore.scala|  6 +++-
 .../spark/status/AppStatusListenerSuite.scala   | 33 
 2 files changed, 38 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/744d5af6/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
--
diff --git a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
index efc2853..688f25a 100644
--- a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
+++ b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
@@ -95,7 +95,11 @@ private[spark] class AppStatusStore(
   }
 
   def lastStageAttempt(stageId: Int): v1.StageData = {
-val it = 
store.view(classOf[StageDataWrapper]).index("stageId").reverse().first(stageId)
+val it = store.view(classOf[StageDataWrapper])
+  .index("stageId")
+  .reverse()
+  .first(stageId)
+  .last(stageId)
   .closeableIterator()
 try {
   if (it.hasNext()) {

http://git-wip-us.apache.org/repos/asf/spark/blob/744d5af6/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala 
b/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
index 7495027..673d191 100644
--- a/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
@@ -1121,6 +1121,39 @@ class AppStatusListenerSuite extends SparkFunSuite with 
BeforeAndAfter {
 }
   }
 
+  test("lastStageAttempt should fail when the stage doesn't exist") {
+val testConf = conf.clone().set(MAX_RETAINED_STAGES, 1)
+val listener = new AppStatusListener(store, testConf, true)
+val appStore = new AppStatusStore(store)
+
+val stage1 = new StageInfo(1, 0, "stage1", 4, Nil, Nil, "details1")
+val stage2 = new StageInfo(2, 0, "stage2", 4, Nil, Nil, "details2")
+val stage3 = new StageInfo(3, 0, "stage3", 4, Nil, Nil, "details3")
+
+time += 1
+stage1.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage1, new 
Properties()))
+stage1.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage1))
+
+// Make stage 3 complete before stage 2 so that stage 3 will be evicted
+time += 1
+stage3.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage3, new 
Properties()))
+stage3.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage3))
+
+time += 1
+stage2.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage2, new 
Properties()))
+stage2.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage2))
+
+assert(appStore.asOption(appStore.lastStageAttempt(1)) === None)
+assert(appStore.asOption(appStore.lastStageAttempt(2)).map(_.stageId) === 
Some(2))
+assert(appStore.asOption(appStore.lastStageAttempt(3)) === None)
+  }
+
   test("driver logs") {
 val listener = new AppStatusListener(store, conf, true)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist

2018-02-21 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 373ac642f -> 23ba4416e


[SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist

## What changes were proposed in this pull request?

The issue here is `AppStatusStore.lastStageAttempt` will return the next 
available stage in the store when a stage doesn't exist.

This PR adds `last(stageId)` to ensure it returns a correct `StageData`

## How was this patch tested?

The new unit test.

Author: Shixiong Zhu 

Closes #20654 from zsxwing/SPARK-23481.

(cherry picked from commit 744d5af652ee8cece361cbca31e5201134e0fb42)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/23ba4416
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/23ba4416
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/23ba4416

Branch: refs/heads/branch-2.3
Commit: 23ba4416e1bbbaa818876d7a837f7a5e260aa048
Parents: 373ac64
Author: Shixiong Zhu 
Authored: Wed Feb 21 15:37:28 2018 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 21 15:37:36 2018 -0800

--
 .../apache/spark/status/AppStatusStore.scala|  6 +++-
 .../spark/status/AppStatusListenerSuite.scala   | 33 
 2 files changed, 38 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/23ba4416/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
--
diff --git a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
index efc2853..688f25a 100644
--- a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
+++ b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
@@ -95,7 +95,11 @@ private[spark] class AppStatusStore(
   }
 
   def lastStageAttempt(stageId: Int): v1.StageData = {
-val it = 
store.view(classOf[StageDataWrapper]).index("stageId").reverse().first(stageId)
+val it = store.view(classOf[StageDataWrapper])
+  .index("stageId")
+  .reverse()
+  .first(stageId)
+  .last(stageId)
   .closeableIterator()
 try {
   if (it.hasNext()) {

http://git-wip-us.apache.org/repos/asf/spark/blob/23ba4416/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala 
b/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
index 01d76a2..f3fa4c9 100644
--- a/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala
@@ -1057,6 +1057,39 @@ class AppStatusListenerSuite extends SparkFunSuite with 
BeforeAndAfter {
 }
   }
 
+  test("lastStageAttempt should fail when the stage doesn't exist") {
+val testConf = conf.clone().set(MAX_RETAINED_STAGES, 1)
+val listener = new AppStatusListener(store, testConf, true)
+val appStore = new AppStatusStore(store)
+
+val stage1 = new StageInfo(1, 0, "stage1", 4, Nil, Nil, "details1")
+val stage2 = new StageInfo(2, 0, "stage2", 4, Nil, Nil, "details2")
+val stage3 = new StageInfo(3, 0, "stage3", 4, Nil, Nil, "details3")
+
+time += 1
+stage1.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage1, new 
Properties()))
+stage1.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage1))
+
+// Make stage 3 complete before stage 2 so that stage 3 will be evicted
+time += 1
+stage3.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage3, new 
Properties()))
+stage3.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage3))
+
+time += 1
+stage2.submissionTime = Some(time)
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage2, new 
Properties()))
+stage2.completionTime = Some(time)
+listener.onStageCompleted(SparkListenerStageCompleted(stage2))
+
+assert(appStore.asOption(appStore.lastStageAttempt(1)) === None)
+assert(appStore.asOption(appStore.lastStageAttempt(2)).map(_.stageId) === 
Some(2))
+assert(appStore.asOption(appStore.lastStageAttempt(3)) === None)
+  }
+
   test("driver logs") {
 val listener = new AppStatusListener(store, conf, true)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset

2018-03-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 4f5bad615 -> 7c3e8995f


[SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset

## What changes were proposed in this pull request?

As discussion in #20675, we need add a new interface 
`ContinuousDataReaderFactory` to support the requirements of setting start 
offset in Continuous Processing.

## How was this patch tested?

Existing UT.

Author: Yuanjian Li 

Closes #20689 from xuanyuanking/SPARK-23533.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c3e8995
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c3e8995
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c3e8995

Branch: refs/heads/master
Commit: 7c3e8995f18a1fb57c1f2c1b98a1d47590e28f38
Parents: 4f5bad6
Author: Yuanjian Li 
Authored: Thu Mar 15 00:04:28 2018 -0700
Committer: Shixiong Zhu 
Committed: Thu Mar 15 00:04:28 2018 -0700

--
 .../sql/kafka010/KafkaContinuousReader.scala| 11 +-
 .../v2/reader/ContinuousDataReaderFactory.java  | 35 
 .../continuous/ContinuousRateStreamSource.scala | 15 -
 3 files changed, 59 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7c3e8995/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
index ecd1170..6e56b0a 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
@@ -164,7 +164,16 @@ case class KafkaContinuousDataReaderFactory(
 startOffset: Long,
 kafkaParams: ju.Map[String, Object],
 pollTimeoutMs: Long,
-failOnDataLoss: Boolean) extends DataReaderFactory[UnsafeRow] {
+failOnDataLoss: Boolean) extends ContinuousDataReaderFactory[UnsafeRow] {
+
+  override def createDataReaderWithOffset(offset: PartitionOffset): 
DataReader[UnsafeRow] = {
+val kafkaOffset = offset.asInstanceOf[KafkaSourcePartitionOffset]
+require(kafkaOffset.topicPartition == topicPartition,
+  s"Expected topicPartition: $topicPartition, but got: 
${kafkaOffset.topicPartition}")
+new KafkaContinuousDataReader(
+  topicPartition, kafkaOffset.partitionOffset, kafkaParams, pollTimeoutMs, 
failOnDataLoss)
+  }
+
   override def createDataReader(): KafkaContinuousDataReader = {
 new KafkaContinuousDataReader(
   topicPartition, startOffset, kafkaParams, pollTimeoutMs, failOnDataLoss)

http://git-wip-us.apache.org/repos/asf/spark/blob/7c3e8995/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReaderFactory.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReaderFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReaderFactory.java
new file mode 100644
index 000..a616976
--- /dev/null
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReaderFactory.java
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset;
+
+/**
+ * A mix-in interface for {@link DataReaderFactory}. Continuous data reader 
factories can
+ * implement this interface to provide creating {@link DataReader} with 
particular offset.
+ */
+@InterfaceStability.Evolving
+public interface ContinuousDataReaderFactory extends DataReaderFactory {
+  /**
+   * Create a DataReader with particular offset as its startOffset.
+   *
+   * @param

spark git commit: [SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer

2018-03-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 9945b0227 -> bd201bf61


[SPARK-23623][SS] Avoid concurrent use of cached consumers in 
CachedKafkaConsumer

## What changes were proposed in this pull request?

CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a 
pool of KafkaConsumers that can be reused. However, it was built with the 
assumption there will be only one task using trying to read the same Kafka 
TopicPartition at the same time. Hence, the cache was keyed by the 
TopicPartition a consumer is supposed to read. And any cases where this 
assumption may not be true, we have SparkPlan flag to disable the use of a 
cache. So it was up to the planner to correctly identify when it was not safe 
to use the cache and set the flag accordingly.

Fundamentally, this is the wrong way to approach the problem. It is HARD for a 
high-level planner to reason about the low-level execution model, whether there 
will be multiple tasks in the same query trying to read the same partition. 
Case in point, 2.3.0 introduced stream-stream joins, and you can build a 
streaming self-join query on Kafka. It's pretty non-trivial to figure out how 
this leads to two tasks reading the same partition twice, possibly 
concurrently. And due to the non-triviality, it is hard to figure this out in 
the planner and set the flag to avoid the cache / consumer pool. And this can 
inadvertently lead to ConcurrentModificationException ,or worse, silent reading 
of incorrect data.

Here is a better way to design this. The planner shouldnt have to understand 
these low-level optimizations. Rather the consumer pool should be smart enough 
avoid concurrent use of a cached consumer. Currently, it tries to do so but 
incorrectly (the flag inuse is not checked when returning a cached consumer, 
see 
[this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)).
 If there is another request for the same partition as a currently in-use 
consumer, the pool should automatically return a fresh consumer that should be 
closed when the task is done. Then the planner does not have to have a flag to 
avoid reuses.

This PR is a step towards that goal. It does the following.
- There are effectively two kinds of consumer that may be generated
  - Cached consumer - this should be returned to the pool at task end
  - Non-cached consumer - this should be closed at task end
- A trait called KafkaConsumer is introduced to hide this difference from the 
users of the consumer so that the client code does not have to reason about 
whether to stop and release. They simply called `val consumer = 
KafkaConsumer.acquire` and then `consumer.release()`.
- If there is request for a consumer that is in-use, then a new consumer is 
generated.
- If there is a concurrent attempt of the same task, then a new consumer is 
generated, and the existing cached consumer is marked for close upon release.
- In addition, I renamed the classes because CachedKafkaConsumer is a misnomer 
given that what it returns may or may not be cached.

This PR does not remove the planner flag to avoid reuse to make this patch safe 
enough for merging in branch-2.3. This can be done later in master-only.

## How was this patch tested?
A new stress test that verifies it is safe to concurrently get consumers for 
the same partition from the consumer pool.

Author: Tathagata Das 

Closes #20767 from tdas/SPARK-23623.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bd201bf6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bd201bf6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bd201bf6

Branch: refs/heads/master
Commit: bd201bf61e8e1713deb91b962f670c76c9e3492b
Parents: 9945b02
Author: Tathagata Das 
Authored: Fri Mar 16 11:11:07 2018 -0700
Committer: Shixiong Zhu 
Committed: Fri Mar 16 11:11:07 2018 -0700

--
 .../sql/kafka010/CachedKafkaConsumer.scala  | 438 
 .../sql/kafka010/KafkaContinuousReader.scala|   5 +-
 .../spark/sql/kafka010/KafkaDataConsumer.scala  | 516 +++
 .../sql/kafka010/KafkaMicroBatchReader.scala|  22 +-
 .../spark/sql/kafka010/KafkaSourceRDD.scala |  23 +-
 .../sql/kafka010/CachedKafkaConsumerSuite.scala |  34 --
 .../sql/kafka010/KafkaDataConsumerSuite.scala   | 124 +
 7 files changed, 651 insertions(+), 511 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bd201bf6/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
 
b/external/kafka-0-10-sql/src/m

spark git commit: [SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer (branch-2.3)

2018-03-17 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 21b6de459 -> 6937571ab


[SPARK-23623][SS] Avoid concurrent use of cached consumers in 
CachedKafkaConsumer (branch-2.3)

This is a backport of #20767 to branch 2.3

## What changes were proposed in this pull request?
CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a 
pool of KafkaConsumers that can be reused. However, it was built with the 
assumption there will be only one task using trying to read the same Kafka 
TopicPartition at the same time. Hence, the cache was keyed by the 
TopicPartition a consumer is supposed to read. And any cases where this 
assumption may not be true, we have SparkPlan flag to disable the use of a 
cache. So it was up to the planner to correctly identify when it was not safe 
to use the cache and set the flag accordingly.

Fundamentally, this is the wrong way to approach the problem. It is HARD for a 
high-level planner to reason about the low-level execution model, whether there 
will be multiple tasks in the same query trying to read the same partition. 
Case in point, 2.3.0 introduced stream-stream joins, and you can build a 
streaming self-join query on Kafka. It's pretty non-trivial to figure out how 
this leads to two tasks reading the same partition twice, possibly 
concurrently. And due to the non-triviality, it is hard to figure this out in 
the planner and set the flag to avoid the cache / consumer pool. And this can 
inadvertently lead to ConcurrentModificationException ,or worse, silent reading 
of incorrect data.

Here is a better way to design this. The planner shouldnt have to understand 
these low-level optimizations. Rather the consumer pool should be smart enough 
avoid concurrent use of a cached consumer. Currently, it tries to do so but 
incorrectly (the flag inuse is not checked when returning a cached consumer, 
see 
[this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)).
 If there is another request for the same partition as a currently in-use 
consumer, the pool should automatically return a fresh consumer that should be 
closed when the task is done. Then the planner does not have to have a flag to 
avoid reuses.

This PR is a step towards that goal. It does the following.
- There are effectively two kinds of consumer that may be generated
  - Cached consumer - this should be returned to the pool at task end
  - Non-cached consumer - this should be closed at task end
- A trait called KafkaConsumer is introduced to hide this difference from the 
users of the consumer so that the client code does not have to reason about 
whether to stop and release. They simply called `val consumer = 
KafkaConsumer.acquire` and then `consumer.release()`.
- If there is request for a consumer that is in-use, then a new consumer is 
generated.
- If there is a concurrent attempt of the same task, then a new consumer is 
generated, and the existing cached consumer is marked for close upon release.
- In addition, I renamed the classes because CachedKafkaConsumer is a misnomer 
given that what it returns may or may not be cached.

This PR does not remove the planner flag to avoid reuse to make this patch safe 
enough for merging in branch-2.3. This can be done later in master-only.

## How was this patch tested?
A new stress test that verifies it is safe to concurrently get consumers for 
the same partition from the consumer pool.

Author: Tathagata Das 

Closes #20848 from tdas/SPARK-23623-2.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6937571a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6937571a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6937571a

Branch: refs/heads/branch-2.3
Commit: 6937571ab8818a62ec2457a373eb3f6f618985e1
Parents: 21b6de4
Author: Tathagata Das 
Authored: Sat Mar 17 16:24:51 2018 -0700
Committer: Shixiong Zhu 
Committed: Sat Mar 17 16:24:51 2018 -0700

--
 .../sql/kafka010/CachedKafkaConsumer.scala  | 438 
 .../sql/kafka010/KafkaContinuousReader.scala|   4 +-
 .../spark/sql/kafka010/KafkaDataConsumer.scala  | 516 +++
 .../spark/sql/kafka010/KafkaSourceRDD.scala |  23 +-
 .../sql/kafka010/CachedKafkaConsumerSuite.scala |  34 --
 .../sql/kafka010/KafkaDataConsumerSuite.scala   | 124 +
 6 files changed, 648 insertions(+), 491 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6937571a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
 
b/external/kafka-0-10

spark git commit: [SPARK-23788][SS] Fix race in StreamingQuerySuite

2018-03-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master a33655348 -> 816a5496b


[SPARK-23788][SS] Fix race in StreamingQuerySuite

## What changes were proposed in this pull request?

The serializability test uses the same MemoryStream instance for 3 different 
queries. If any of those queries ask it to commit before the others have run, 
the rest will see empty dataframes. This can fail the test if q3 is affected.

We should use one instance per query instead.

## How was this patch tested?

Existing unit test. If I move q2.processAllAvailable() before starting q3, the 
test always fails without the fix.

Author: Jose Torres 

Closes #20896 from jose-torres/fixrace.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/816a5496
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/816a5496
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/816a5496

Branch: refs/heads/master
Commit: 816a5496ba4caac438f70400f72bb10bfcc02418
Parents: a336553
Author: Jose Torres 
Authored: Sat Mar 24 18:21:01 2018 -0700
Committer: Shixiong Zhu 
Committed: Sat Mar 24 18:21:01 2018 -0700

--
 .../apache/spark/sql/streaming/StreamingQuerySuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/816a5496/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index ebc9a87..08749b4 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -550,22 +550,22 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 .start()
 }
 
-val input = MemoryStream[Int]
-val q1 = startQuery(input.toDS, "stream_serializable_test_1")
-val q2 = startQuery(input.toDS.map { i =>
+val input = MemoryStream[Int] :: MemoryStream[Int] :: MemoryStream[Int] :: 
Nil
+val q1 = startQuery(input(0).toDS, "stream_serializable_test_1")
+val q2 = startQuery(input(1).toDS.map { i =>
   // Emulate that `StreamingQuery` get captured with normal usage 
unintentionally.
   // It should not fail the query.
   q1
   i
 }, "stream_serializable_test_2")
-val q3 = startQuery(input.toDS.map { i =>
+val q3 = startQuery(input(2).toDS.map { i =>
   // Emulate that `StreamingQuery` is used in executors. We should fail 
the query with a clear
   // error message.
   q1.explain()
   i
 }, "stream_serializable_test_3")
 try {
-  input.addData(1)
+  input.foreach(_.addData(1))
 
   // q2 should not fail since it doesn't use `q1` in the closure
   q2.processAllAvailable()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23788][SS] Fix race in StreamingQuerySuite

2018-03-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 ea44783ad -> 523fcafc5


[SPARK-23788][SS] Fix race in StreamingQuerySuite

## What changes were proposed in this pull request?

The serializability test uses the same MemoryStream instance for 3 different 
queries. If any of those queries ask it to commit before the others have run, 
the rest will see empty dataframes. This can fail the test if q3 is affected.

We should use one instance per query instead.

## How was this patch tested?

Existing unit test. If I move q2.processAllAvailable() before starting q3, the 
test always fails without the fix.

Author: Jose Torres 

Closes #20896 from jose-torres/fixrace.

(cherry picked from commit 816a5496ba4caac438f70400f72bb10bfcc02418)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/523fcafc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/523fcafc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/523fcafc

Branch: refs/heads/branch-2.3
Commit: 523fcafc5c4a79cf3455f3ceab6d886679399495
Parents: ea44783
Author: Jose Torres 
Authored: Sat Mar 24 18:21:01 2018 -0700
Committer: Shixiong Zhu 
Committed: Sat Mar 24 18:21:14 2018 -0700

--
 .../apache/spark/sql/streaming/StreamingQuerySuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/523fcafc/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 76201c6..2b0ab33 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -532,22 +532,22 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 .start()
 }
 
-val input = MemoryStream[Int]
-val q1 = startQuery(input.toDS, "stream_serializable_test_1")
-val q2 = startQuery(input.toDS.map { i =>
+val input = MemoryStream[Int] :: MemoryStream[Int] :: MemoryStream[Int] :: 
Nil
+val q1 = startQuery(input(0).toDS, "stream_serializable_test_1")
+val q2 = startQuery(input(1).toDS.map { i =>
   // Emulate that `StreamingQuery` get captured with normal usage 
unintentionally.
   // It should not fail the query.
   q1
   i
 }, "stream_serializable_test_2")
-val q3 = startQuery(input.toDS.map { i =>
+val q3 = startQuery(input(2).toDS.map { i =>
   // Emulate that `StreamingQuery` is used in executors. We should fail 
the query with a clear
   // error message.
   q1.explain()
   i
 }, "stream_serializable_test_3")
 try {
-  input.addData(1)
+  input.foreach(_.addData(1))
 
   // q2 should not fail since it doesn't use `q1` in the closure
   q2.processAllAvailable()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23788][SS] Fix race in StreamingQuerySuite

2018-03-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 85ab72b59 -> 6b5f9c374


[SPARK-23788][SS] Fix race in StreamingQuerySuite

## What changes were proposed in this pull request?

The serializability test uses the same MemoryStream instance for 3 different 
queries. If any of those queries ask it to commit before the others have run, 
the rest will see empty dataframes. This can fail the test if q3 is affected.

We should use one instance per query instead.

## How was this patch tested?

Existing unit test. If I move q2.processAllAvailable() before starting q3, the 
test always fails without the fix.

Author: Jose Torres 

Closes #20896 from jose-torres/fixrace.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b5f9c37
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b5f9c37
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b5f9c37

Branch: refs/heads/branch-2.2
Commit: 6b5f9c3745a1005519261fc80825a99377906451
Parents: 85ab72b
Author: Jose Torres 
Authored: Sat Mar 24 18:21:01 2018 -0700
Committer: Shixiong Zhu 
Committed: Sat Mar 24 18:22:15 2018 -0700

--
 .../apache/spark/sql/streaming/StreamingQuerySuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b5f9c37/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 01c34b1..9e65aa8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -533,22 +533,22 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
 .start()
 }
 
-val input = MemoryStream[Int]
-val q1 = startQuery(input.toDS, "stream_serializable_test_1")
-val q2 = startQuery(input.toDS.map { i =>
+val input = MemoryStream[Int] :: MemoryStream[Int] :: MemoryStream[Int] :: 
Nil
+val q1 = startQuery(input(0).toDS, "stream_serializable_test_1")
+val q2 = startQuery(input(1).toDS.map { i =>
   // Emulate that `StreamingQuery` get captured with normal usage 
unintentionally.
   // It should not fail the query.
   q1
   i
 }, "stream_serializable_test_2")
-val q3 = startQuery(input.toDS.map { i =>
+val q3 = startQuery(input(2).toDS.map { i =>
   // Emulate that `StreamingQuery` is used in executors. We should fail 
the query with a clear
   // error message.
   q1.explain()
   i
 }, "stream_serializable_test_3")
 try {
-  input.addData(1)
+  input.foreach(_.addData(1))
 
   // q2 should not fail since it doesn't use `q1` in the closure
   q2.processAllAvailable()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly

2017-01-10 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 3ef183a94 -> d5b1dc934


[SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly

## What changes were proposed in this pull request?

`DataStreamReaderWriterSuite` makes test files in source folder like the 
followings. Interestingly, the root cause is `withSQLConf` fails to reset 
`OptionalConfigEntry` correctly. In other words, it resets the config into 
`Some(undefined)`.

```bash
$ git status
Untracked files:
  (use "git add ..." to include in what will be committed)

sql/core/%253Cundefined%253E/
sql/core/%3Cundefined%3E/
```

## How was this patch tested?

Manual.
```
build/sbt "project sql" test
git status
```

Author: Dongjoon Hyun 

Closes #16522 from dongjoon-hyun/SPARK-19137.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d5b1dc93
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d5b1dc93
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d5b1dc93

Branch: refs/heads/master
Commit: d5b1dc934a2482886c2c095de90e4c6a49ec42bd
Parents: 3ef183a
Author: Dongjoon Hyun 
Authored: Tue Jan 10 10:49:44 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 10 10:49:44 2017 -0800

--
 .../test/scala/org/apache/spark/sql/test/SQLTestUtils.scala  | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d5b1dc93/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
index d4d8e3e..d4afb9d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
@@ -94,7 +94,13 @@ private[sql] trait SQLTestUtils
*/
   protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
 val (keys, values) = pairs.unzip
-val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+val currentValues = keys.map { key =>
+  if (spark.conf.contains(key)) {
+Some(spark.conf.get(key))
+  } else {
+None
+  }
+}
 (keys, values).zipped.foreach(spark.conf.set)
 try f finally {
   keys.zip(currentValues).foreach {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly

2017-01-10 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 65c866ef9 -> 69d1c4c5c


[SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly

## What changes were proposed in this pull request?

`DataStreamReaderWriterSuite` makes test files in source folder like the 
followings. Interestingly, the root cause is `withSQLConf` fails to reset 
`OptionalConfigEntry` correctly. In other words, it resets the config into 
`Some(undefined)`.

```bash
$ git status
Untracked files:
  (use "git add ..." to include in what will be committed)

sql/core/%253Cundefined%253E/
sql/core/%3Cundefined%3E/
```

## How was this patch tested?

Manual.
```
build/sbt "project sql" test
git status
```

Author: Dongjoon Hyun 

Closes #16522 from dongjoon-hyun/SPARK-19137.

(cherry picked from commit d5b1dc934a2482886c2c095de90e4c6a49ec42bd)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69d1c4c5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69d1c4c5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69d1c4c5

Branch: refs/heads/branch-2.1
Commit: 69d1c4c5c9510ccf05a0f05592201d5b756425f9
Parents: 65c866e
Author: Dongjoon Hyun 
Authored: Tue Jan 10 10:49:44 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 10 10:49:54 2017 -0800

--
 .../test/scala/org/apache/spark/sql/test/SQLTestUtils.scala  | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/69d1c4c5/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
index d4d8e3e..d4afb9d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
@@ -94,7 +94,13 @@ private[sql] trait SQLTestUtils
*/
   protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
 val (keys, values) = pairs.unzip
-val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+val currentValues = keys.map { key =>
+  if (spark.conf.contains(key)) {
+Some(spark.conf.get(key))
+  } else {
+None
+  }
+}
 (keys, values).zipped.foreach(spark.conf.set)
 try f finally {
   keys.zip(currentValues).foreach {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization

2017-01-10 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 69d1c4c5c -> e0af4b726


[SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to 
ensure catching fatal errors during query initialization

## What changes were proposed in this pull request?

StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may 
not be able to catch fatal errors during query initialization. This PR uses 
`onQueryStarted` callback to fix it.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16492 from zsxwing/SPARK-19113.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0af4b72
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0af4b72
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0af4b72

Branch: refs/heads/branch-2.1
Commit: e0af4b7263a49419fefc36a6dedf2183c1157912
Parents: 69d1c4c
Author: Shixiong Zhu 
Authored: Tue Jan 10 14:24:45 2017 +
Committer: Shixiong Zhu 
Committed: Tue Jan 10 10:51:20 2017 -0800

--
 .../spark/sql/streaming/StreamSuite.scala   |  7 +++--
 .../apache/spark/sql/streaming/StreamTest.scala | 28 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e0af4b72/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
index 34b0ee8..e964e64 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
@@ -238,7 +238,7 @@ class StreamSuite extends StreamTest {
 }
   }
 
-  testQuietly("fatal errors from a source should be sent to the user") {
+  testQuietly("handle fatal errors thrown from the stream thread") {
 for (e <- Seq(
   new VirtualMachineError {},
   new ThreadDeath,
@@ -259,8 +259,11 @@ class StreamSuite extends StreamTest {
 override def stop(): Unit = {}
   }
   val df = Dataset[Int](sqlContext.sparkSession, 
StreamingExecutionRelation(source))
-  // These error are fatal errors and should be ignored in `testStream` to 
not fail the test.
   testStream(df)(
+// `ExpectFailure(isFatalError = true)` verifies two things:
+// - Fatal errors can be propagated to `StreamingQuery.exception` and
+//   `StreamingQuery.awaitTermination` like non fatal errors.
+// - Fatal errors can be caught by UncaughtExceptionHandler.
 ExpectFailure(isFatalError = true)(ClassTag(e.getClass))
   )
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/e0af4b72/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index 709050d..4aa4100 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -235,7 +235,10 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with Timeouts {
*/
   def testStream(
   _stream: Dataset[_],
-  outputMode: OutputMode = OutputMode.Append)(actions: StreamAction*): 
Unit = {
+  outputMode: OutputMode = OutputMode.Append)(actions: StreamAction*): 
Unit = synchronized {
+// `synchronized` is added to prevent the user from calling multiple 
`testStream`s concurrently
+// because this method assumes there is only one active query in its 
`StreamingQueryListener`
+// and it may not work correctly when multiple `testStream`s run 
concurrently.
 
 val stream = _stream.toDF()
 val sparkSession = stream.sparkSession  // use the session in DF, not the 
default session
@@ -248,6 +251,22 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with Timeouts {
 
 @volatile
 var streamThreadDeathCause: Throwable = null
+// Set UncaughtExceptionHandler in `onQueryStarted` so that we can ensure 
catching fatal errors
+// during query initialization.
+val listener = new StreamingQueryListener {
+  override def onQueryStarted(event: QueryStartedEvent): Unit = {
+// Note: this assumes there is only one query active in the 
`testStream` method.
+Thread.currentThread.setUncaughtExceptionHandler(new 
UncaughtExceptionHandler {
+  override def uncaughtException(t: Thread, e: Throwable): Unit = {
+

spark git commit: [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries

2017-01-10 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 856bae6af -> bc6c56e94


[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries

## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be 
same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16520 from zsxwing/update-without-agg.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc6c56e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc6c56e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc6c56e9

Branch: refs/heads/master
Commit: bc6c56e940fe93591a1e5ba45751f1b243b57e28
Parents: 856bae6
Author: Shixiong Zhu 
Authored: Tue Jan 10 17:58:11 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 10 17:58:11 2017 -0800

--
 docs/structured-streaming-programming-guide.md  |  4 +--
 python/pyspark/sql/streaming.py | 27 ++-
 .../apache/spark/sql/streaming/OutputMode.java  |  3 +-
 .../analysis/UnsupportedOperationChecker.scala  |  2 +-
 .../streaming/InternalOutputModes.scala |  4 +--
 .../analysis/UnsupportedOperationsSuite.scala   | 31 +
 .../spark/sql/streaming/DataStreamWriter.scala  | 18 --
 .../execution/streaming/MemorySinkSuite.scala   | 35 
 8 files changed, 72 insertions(+), 52 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bc6c56e9/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 52dbbc8..b816072 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -374,7 +374,7 @@ The "Output" is defined as what gets written out to the 
external storage. The ou
 
   - *Append Mode* - Only the new rows appended in the Result Table since the 
last trigger will be written to the external storage. This is applicable only 
on the queries where existing rows in the Result Table are not expected to 
change.
   
-  - *Update Mode* - Only the rows that were updated in the Result Table since 
the last trigger will be written to the external storage (available since Spark 
2.1.1). Note that this is different from the Complete Mode in that this mode 
only outputs the rows that have changed since the last trigger.
+  - *Update Mode* - Only the rows that were updated in the Result Table since 
the last trigger will be written to the external storage (available since Spark 
2.1.1). Note that this is different from the Complete Mode in that this mode 
only outputs the rows that have changed since the last trigger. If the query 
doesn't contain aggregations, it will be equivalent to Append mode.
 
 Note that each mode is applicable on certain types of queries. This is 
discussed in detail [later](#output-modes).
 
@@ -977,7 +977,7 @@ Here is the compatibility matrix.
   
   
 Queries without 
aggregation
-Append
+Append, Update
 
 Complete mode not supported as it is infeasible to keep all data in 
the Result Table.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/bc6c56e9/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 5014299..a10b185 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -665,6 +665,9 @@ class DataStreamWriter(object):
the sink
 * `complete`:All the rows in the streaming DataFrame/Dataset will be 
written to the sink
every time these is some updates
+* `update`:only the rows that were updated in the streaming 
DataFrame/Dataset will be
+   written to the sink every time there are some updates. If the query 
doesn't contain
+   aggregations, it will be equivalent to `append` mode.
 
.. note:: Experimental.
 
@@ -768,7 +771,8 @@ class DataStreamWriter(object):
 
 @ignore_unicode_prefix
 @since(2.0)
-def start(self, path=None, format=None, partitionBy=None, queryName=None, 
**options):
+def start(self, path=None, format=None, outputMode=None, partitionBy=None, 
queryName=None,
+  **options):
 """Streams the contents of the :class:`DataFrame` to a data source.
 
 The data source is specified by the ``format`` and a set of 
``options``.
@@ -779,15 +783,20 @@ class DataStreamWriter(object):
 
 :param path: the path in a Hadoop supported file system
 :param format: the format

spark git commit: [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries

2017-01-10 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 81c943090 -> 230607d62


[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries

## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be 
same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16520 from zsxwing/update-without-agg.

(cherry picked from commit bc6c56e940fe93591a1e5ba45751f1b243b57e28)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/230607d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/230607d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/230607d6

Branch: refs/heads/branch-2.1
Commit: 230607d62493c36b214c01a70aa9b0dbb3a9ad4d
Parents: 81c9430
Author: Shixiong Zhu 
Authored: Tue Jan 10 17:58:11 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 10 17:58:23 2017 -0800

--
 docs/structured-streaming-programming-guide.md  |  4 +--
 python/pyspark/sql/streaming.py | 27 ++-
 .../apache/spark/sql/streaming/OutputMode.java  |  3 +-
 .../analysis/UnsupportedOperationChecker.scala  |  2 +-
 .../streaming/InternalOutputModes.scala |  4 +--
 .../analysis/UnsupportedOperationsSuite.scala   | 31 +
 .../spark/sql/streaming/DataStreamWriter.scala  | 18 --
 .../execution/streaming/MemorySinkSuite.scala   | 35 
 8 files changed, 72 insertions(+), 52 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/230607d6/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 473a196..45ee551 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -374,7 +374,7 @@ The "Output" is defined as what gets written out to the 
external storage. The ou
 
   - *Append Mode* - Only the new rows appended in the Result Table since the 
last trigger will be written to the external storage. This is applicable only 
on the queries where existing rows in the Result Table are not expected to 
change.
   
-  - *Update Mode* - Only the rows that were updated in the Result Table since 
the last trigger will be written to the external storage (available since Spark 
2.1.1). Note that this is different from the Complete Mode in that this mode 
only outputs the rows that have changed since the last trigger.
+  - *Update Mode* - Only the rows that were updated in the Result Table since 
the last trigger will be written to the external storage (available since Spark 
2.1.1). Note that this is different from the Complete Mode in that this mode 
only outputs the rows that have changed since the last trigger. If the query 
doesn't contain aggregations, it will be equivalent to Append mode.
 
 Note that each mode is applicable on certain types of queries. This is 
discussed in detail [later](#output-modes).
 
@@ -977,7 +977,7 @@ Here is the compatibility matrix.
   
   
 Queries without 
aggregation
-Append
+Append, Update
 
 Complete mode not supported as it is infeasible to keep all data in 
the Result Table.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/230607d6/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 5014299..a10b185 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -665,6 +665,9 @@ class DataStreamWriter(object):
the sink
 * `complete`:All the rows in the streaming DataFrame/Dataset will be 
written to the sink
every time these is some updates
+* `update`:only the rows that were updated in the streaming 
DataFrame/Dataset will be
+   written to the sink every time there are some updates. If the query 
doesn't contain
+   aggregations, it will be equivalent to `append` mode.
 
.. note:: Experimental.
 
@@ -768,7 +771,8 @@ class DataStreamWriter(object):
 
 @ignore_unicode_prefix
 @since(2.0)
-def start(self, path=None, format=None, partitionBy=None, queryName=None, 
**options):
+def start(self, path=None, format=None, outputMode=None, partitionBy=None, 
queryName=None,
+  **options):
 """Streams the contents of the :class:`DataFrame` to a data source.
 
 The data source is specified by the ``format`` and a set of 
``options``.
@@ -779,15 +783,20 @@ class DataStreamWriter(object):

spark git commit: [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets

2017-01-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master c84f7d3e1 -> f8db8945f


[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from 
JobScheduler.jobSets

## What changes were proposed in this pull request?

the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
Let's consider the following case:
A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic, after the other one is finished successfully.
1. The main thread in the Spark streaming application will execute the line 
mentioned above,
2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed.
3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed

This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs 
in a jobset are successfully finished

## How was this patch tested?

existing tests

Author: CodingCat 
Author: Nan Zhu 

Closes #16542 from CodingCat/SPARK-18905.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8db8945
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8db8945
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8db8945

Branch: refs/heads/master
Commit: f8db8945f25cb884278ff8841bac5f6f28f0dec6
Parents: c84f7d3
Author: CodingCat 
Authored: Mon Jan 16 18:33:20 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 16 18:33:20 2017 -0800

--
 .../spark/streaming/scheduler/JobScheduler.scala  | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f8db8945/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
index b7d114b..2fa3bf7 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
@@ -201,18 +201,20 @@ class JobScheduler(val ssc: StreamingContext) extends 
Logging {
 
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
 logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
 if (jobSet.hasCompleted) {
-  jobSets.remove(jobSet.time)
-  jobGenerator.onBatchCompletion(jobSet.time)
-  logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
-jobSet.totalDelay / 1000.0, jobSet.time.toString,
-jobSet.processingDelay / 1000.0
-  ))
   listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
 }
 job.result match {
   case Failure(e) =>
 reportError("Error running job " + job, e)
   case _ =>
+if (jobSet.hasCompleted) {
+  jobSets.remove(jobSet.time)
+  jobGenerator.onBatchCompletion(jobSet.time)
+  logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
+jobSet.totalDelay / 1000.0, jobSet.time.toString,
+jobSet.processingDelay / 1000.0
+  ))
+}
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets

2017-01-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 975890507 -> f4317be66


[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from 
JobScheduler.jobSets

## What changes were proposed in this pull request?

the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
Let's consider the following case:
A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic, after the other one is finished successfully.
1. The main thread in the Spark streaming application will execute the line 
mentioned above,
2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed.
3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed

This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs 
in a jobset are successfully finished

## How was this patch tested?

existing tests

Author: CodingCat 
Author: Nan Zhu 

Closes #16542 from CodingCat/SPARK-18905.

(cherry picked from commit f8db8945f25cb884278ff8841bac5f6f28f0dec6)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4317be6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4317be6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4317be6

Branch: refs/heads/branch-2.1
Commit: f4317be66d0e169693e3407abf3d0bfa4d7e37af
Parents: 9758905
Author: CodingCat 
Authored: Mon Jan 16 18:33:20 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 16 18:33:29 2017 -0800

--
 .../spark/streaming/scheduler/JobScheduler.scala  | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f4317be6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
index 98e0993..74ec19f 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
@@ -200,18 +200,20 @@ class JobScheduler(val ssc: StreamingContext) extends 
Logging {
 
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
 logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
 if (jobSet.hasCompleted) {
-  jobSets.remove(jobSet.time)
-  jobGenerator.onBatchCompletion(jobSet.time)
-  logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
-jobSet.totalDelay / 1000.0, jobSet.time.toString,
-jobSet.processingDelay / 1000.0
-  ))
   listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
 }
 job.result match {
   case Failure(e) =>
 reportError("Error running job " + job, e)
   case _ =>
+if (jobSet.hasCompleted) {
+  jobSets.remove(jobSet.time)
+  jobGenerator.onBatchCompletion(jobSet.time)
+  logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
+jobSet.totalDelay / 1000.0, jobSet.time.toString,
+jobSet.processingDelay / 1000.0
+  ))
+}
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests

2017-01-18 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 33791a8ce -> c050c1227


[SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from 
awaitInitialization to avoid breaking tests

## What changes were proposed in this pull request?

#16492 missed one race condition: `StreamExecution.awaitInitialization` may 
throw fatal errors and fail the test. This PR just ignores 
`StreamingQueryException` thrown from `awaitInitialization` so that we can 
verify the exception in the `ExpectFailure` action later. It's fine since 
`StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16567 from zsxwing/SPARK-19113-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c050c122
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c050c122
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c050c122

Branch: refs/heads/master
Commit: c050c12274fba2ac4c4938c4724049a47fa59280
Parents: 33791a8
Author: Shixiong Zhu 
Authored: Wed Jan 18 10:50:51 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 18 10:50:51 2017 -0800

--
 .../scala/org/apache/spark/sql/streaming/StreamTest.scala | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c050c122/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index 4aa4100..af2f31a 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -385,7 +385,12 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with Timeouts {
 .streamingQuery
 // Wait until the initialization finishes, because some tests need 
to use `logicalPlan`
 // after starting the query.
-currentStream.awaitInitialization(streamingTimeout.toMillis)
+try {
+  currentStream.awaitInitialization(streamingTimeout.toMillis)
+} catch {
+  case _: StreamingQueryException =>
+// Ignore the exception. `StopStream` or `ExpectFailure` will 
catch it as well.
+}
 
   case AdvanceManualClock(timeToAdd) =>
 verify(currentStream != null,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests

2017-01-18 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 77202a6c5 -> 047506bae


[SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from 
awaitInitialization to avoid breaking tests

## What changes were proposed in this pull request?

#16492 missed one race condition: `StreamExecution.awaitInitialization` may 
throw fatal errors and fail the test. This PR just ignores 
`StreamingQueryException` thrown from `awaitInitialization` so that we can 
verify the exception in the `ExpectFailure` action later. It's fine since 
`StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16567 from zsxwing/SPARK-19113-2.

(cherry picked from commit c050c12274fba2ac4c4938c4724049a47fa59280)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/047506ba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/047506ba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/047506ba

Branch: refs/heads/branch-2.1
Commit: 047506bae4f9a3505ac886ba04969d8d11f5
Parents: 77202a6
Author: Shixiong Zhu 
Authored: Wed Jan 18 10:50:51 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 18 10:51:00 2017 -0800

--
 .../scala/org/apache/spark/sql/streaming/StreamTest.scala | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/047506ba/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index 4aa4100..af2f31a 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -385,7 +385,12 @@ trait StreamTest extends QueryTest with SharedSQLContext 
with Timeouts {
 .streamingQuery
 // Wait until the initialization finishes, because some tests need 
to use `logicalPlan`
 // after starting the query.
-currentStream.awaitInitialization(streamingTimeout.toMillis)
+try {
+  currentStream.awaitInitialization(streamingTimeout.toMillis)
+} catch {
+  case _: StreamingQueryException =>
+// Ignore the exception. `StopStream` or `ExpectFailure` will 
catch it as well.
+}
 
   case AdvanceManualClock(timeToAdd) =>
 verify(currentStream != null,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error

2017-01-18 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 047506bae -> 4cff0b504


[SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error

## What changes were proposed in this pull request?

We should call `StateStore.abort()` when there should be any error before the 
store is committed.

## How was this patch tested?

Manually.

Author: Liwei Lin 

Closes #16547 from lw-lin/append-filter.

(cherry picked from commit 569e50680f97b1ed054337a39fe198769ef52d93)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4cff0b50
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4cff0b50
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4cff0b50

Branch: refs/heads/branch-2.1
Commit: 4cff0b504c367db314f10e730fe39dc083529f16
Parents: 047506b
Author: Liwei Lin 
Authored: Wed Jan 18 10:52:47 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 18 10:52:54 2017 -0800

--
 .../spark/sql/execution/streaming/StatefulAggregate.scala| 8 
 .../streaming/state/HDFSBackedStateStoreProvider.scala   | 2 +-
 .../spark/sql/execution/streaming/state/StateStore.scala | 2 +-
 3 files changed, 10 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4cff0b50/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
index 0551e4b..d4ccced 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
@@ -31,6 +31,7 @@ import org.apache.spark.sql.execution.streaming.state._
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.streaming.OutputMode
 import org.apache.spark.sql.types.StructType
+import org.apache.spark.TaskContext
 
 
 /** Used to identify the state store for a given operator. */
@@ -150,6 +151,13 @@ case class StateStoreSaveExec(
 val numTotalStateRows = longMetric("numTotalStateRows")
 val numUpdatedStateRows = longMetric("numUpdatedStateRows")
 
+// Abort the state store in case of error
+TaskContext.get().addTaskCompletionListener(_ => {
+  if (!store.hasCommitted) {
+store.abort()
+  }
+})
+
 outputMode match {
   // Update and output all rows in the StateStore.
   case Some(Complete) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/4cff0b50/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
index 4f3f818..1279b71 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
@@ -203,7 +203,7 @@ private[state] class HDFSBackedStateStoreProvider(
 /**
  * Whether all updates have been committed
  */
-override private[state] def hasCommitted: Boolean = {
+override private[streaming] def hasCommitted: Boolean = {
   state == COMMITTED
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4cff0b50/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
index 9bc6c0e..d59746f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
@@ -83,7 +83,7 @@ trait StateStore {
   /**
* Whether all updates have been committed
*/
-  private[state] def hasCommitted: Boolean
+  private[streaming] def hasCommitted: Boolean
 }
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error

2017-01-18 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master c050c1227 -> 569e50680


[SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error

## What changes were proposed in this pull request?

We should call `StateStore.abort()` when there should be any error before the 
store is committed.

## How was this patch tested?

Manually.

Author: Liwei Lin 

Closes #16547 from lw-lin/append-filter.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/569e5068
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/569e5068
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/569e5068

Branch: refs/heads/master
Commit: 569e50680f97b1ed054337a39fe198769ef52d93
Parents: c050c12
Author: Liwei Lin 
Authored: Wed Jan 18 10:52:47 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 18 10:52:47 2017 -0800

--
 .../spark/sql/execution/streaming/StatefulAggregate.scala| 8 
 .../streaming/state/HDFSBackedStateStoreProvider.scala   | 2 +-
 .../spark/sql/execution/streaming/state/StateStore.scala | 2 +-
 3 files changed, 10 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/569e5068/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
index 0551e4b..d4ccced 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
@@ -31,6 +31,7 @@ import org.apache.spark.sql.execution.streaming.state._
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.streaming.OutputMode
 import org.apache.spark.sql.types.StructType
+import org.apache.spark.TaskContext
 
 
 /** Used to identify the state store for a given operator. */
@@ -150,6 +151,13 @@ case class StateStoreSaveExec(
 val numTotalStateRows = longMetric("numTotalStateRows")
 val numUpdatedStateRows = longMetric("numUpdatedStateRows")
 
+// Abort the state store in case of error
+TaskContext.get().addTaskCompletionListener(_ => {
+  if (!store.hasCommitted) {
+store.abort()
+  }
+})
+
 outputMode match {
   // Update and output all rows in the StateStore.
   case Some(Complete) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/569e5068/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
index 4f3f818..1279b71 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
@@ -203,7 +203,7 @@ private[state] class HDFSBackedStateStoreProvider(
 /**
  * Whether all updates have been committed
  */
-override private[state] def hasCommitted: Boolean = {
+override private[streaming] def hasCommitted: Boolean = {
   state == COMMITTED
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/569e5068/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
index 9bc6c0e..d59746f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
@@ -83,7 +83,7 @@ trait StateStore {
   /**
* Whether all updates have been committed
*/
-  private[state] def hasCommitted: Boolean
+  private[streaming] def hasCommitted: Boolean
 }
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19182][DSTREAM] Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs

2017-01-18 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 569e50680 -> a81e336f1


[SPARK-19182][DSTREAM] Optimize the lock in StreamingJobProgressListener to not 
block UI when generating Streaming jobs

## What changes were proposed in this pull request?

When DStreamGraph is generating a job, it will hold a lock and block other 
APIs. Because StreamingJobProgressListener (numInactiveReceivers, 
streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to 
access some information, the UI may hang if generating a job is very slow 
(e.g., talking to the slow Kafka cluster to fetch metadata).
It's better to optimize the locks in DStreamGraph and 
StreamingJobProgressListener to make the UI not block by job generation.

## How was this patch tested?
existing ut

cc zsxwing

Author: uncleGen 

Closes #16601 from uncleGen/SPARK-19182.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a81e336f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a81e336f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a81e336f

Branch: refs/heads/master
Commit: a81e336f1eddc2c6245d807aae2c81ddc60eabf9
Parents: 569e506
Author: uncleGen 
Authored: Wed Jan 18 10:55:31 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Jan 18 10:55:31 2017 -0800

--
 .../org/apache/spark/streaming/DStreamGraph.scala  | 13 +
 .../streaming/ui/StreamingJobProgressListener.scala|  8 
 2 files changed, 13 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a81e336f/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
index 54d736e..dce2028 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
@@ -31,12 +31,15 @@ final private[streaming] class DStreamGraph extends 
Serializable with Logging {
   private val inputStreams = new ArrayBuffer[InputDStream[_]]()
   private val outputStreams = new ArrayBuffer[DStream[_]]()
 
+  @volatile private var inputStreamNameAndID: Seq[(String, Int)] = Nil
+
   var rememberDuration: Duration = null
   var checkpointInProgress = false
 
   var zeroTime: Time = null
   var startTime: Time = null
   var batchDuration: Duration = null
+  @volatile private var numReceivers: Int = 0
 
   def start(time: Time) {
 this.synchronized {
@@ -45,7 +48,9 @@ final private[streaming] class DStreamGraph extends 
Serializable with Logging {
   startTime = time
   outputStreams.foreach(_.initialize(zeroTime))
   outputStreams.foreach(_.remember(rememberDuration))
-  outputStreams.foreach(_.validateAtStart)
+  outputStreams.foreach(_.validateAtStart())
+  numReceivers = 
inputStreams.count(_.isInstanceOf[ReceiverInputDStream[_]])
+  inputStreamNameAndID = inputStreams.map(is => (is.name, is.id))
   inputStreams.par.foreach(_.start())
 }
   }
@@ -106,9 +111,9 @@ final private[streaming] class DStreamGraph extends 
Serializable with Logging {
   .toArray
   }
 
-  def getInputStreamName(streamId: Int): Option[String] = synchronized {
-inputStreams.find(_.id == streamId).map(_.name)
-  }
+  def getNumReceivers: Int = numReceivers
+
+  def getInputStreamNameAndID: Seq[(String, Int)] = inputStreamNameAndID
 
   def generateJobs(time: Time): Seq[Job] = {
 logDebug("Generating jobs for time " + time)

http://git-wip-us.apache.org/repos/asf/spark/blob/a81e336f/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
index 95f5821..ed4c1e4 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
@@ -169,7 +169,7 @@ private[spark] class StreamingJobProgressListener(ssc: 
StreamingContext)
   }
 
   def numInactiveReceivers: Int = {
-ssc.graph.getReceiverInputStreams().length - numActiveReceivers
+ssc.graph.getNumReceivers - numActiveReceivers
   }
 
   def numTotalCompletedBatches: Long = synchronized {
@@ -197,17 +197,17 @@ private[spark] class StreamingJobProgressListener(ssc: 
StreamingContext)
   }
 
   def retainedCompletedBatches: Seq[BatchUIData] = synchronized {
-

spark git commit: [SPARK-19268][SS] Disallow adaptive query execution for streaming queries

2017-01-23 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 4a2be0902 -> 570e5e11d


[SPARK-19268][SS] Disallow adaptive query execution for streaming queries

## What changes were proposed in this pull request?

As adaptive query execution may change the number of partitions in different 
batches, it may break streaming queries. Hence, we should disallow this feature 
in Structured Streaming.

## How was this patch tested?

`test("SPARK-19268: Adaptive query execution should be disallowed")`.

Author: Shixiong Zhu 

Closes #16683 from zsxwing/SPARK-19268.

(cherry picked from commit 60bd91a34078a9239fbf5e8f49ce8b680c11635d)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/570e5e11
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/570e5e11
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/570e5e11

Branch: refs/heads/branch-2.1
Commit: 570e5e11dfd5d9fa3ee6caae3bba85d53ceac4e8
Parents: 4a2be09
Author: Shixiong Zhu 
Authored: Mon Jan 23 22:30:51 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 23 22:31:01 2017 -0800

--
 .../spark/sql/streaming/StreamingQueryManager.scala |  6 ++
 .../sql/streaming/StreamingQueryManagerSuite.scala  | 12 +++-
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/570e5e11/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
index 7b9770d..0b9406b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
@@ -230,6 +230,12 @@ class StreamingQueryManager private[sql] (sparkSession: 
SparkSession) {
   UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
 }
 
+if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
+  throw new AnalysisException(
+s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
+  "is not supported in streaming DataFrames/Datasets")
+}
+
 new StreamingQueryWrapper(new StreamExecution(
   sparkSession,
   userSpecifiedName.orNull,

http://git-wip-us.apache.org/repos/asf/spark/blob/570e5e11/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
index 8e16fd4..f05e9d1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
@@ -30,8 +30,9 @@ import org.scalatest.time.Span
 import org.scalatest.time.SpanSugar._
 
 import org.apache.spark.SparkException
-import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.{AnalysisException, Dataset}
 import org.apache.spark.sql.execution.streaming._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.streaming.util.BlockingSource
 import org.apache.spark.util.Utils
 
@@ -238,6 +239,15 @@ class StreamingQueryManagerSuite extends StreamTest with 
BeforeAndAfter {
 }
   }
 
+  test("SPARK-19268: Adaptive query execution should be disallowed") {
+withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
+  val e = intercept[AnalysisException] {
+
MemoryStream[Int].toDS.writeStream.queryName("test-query").format("memory").start()
+  }
+  assert(e.getMessage.contains(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key) &&
+e.getMessage.contains("not supported"))
+}
+  }
 
   /** Run a body of code by defining a query on each dataset */
   private def withQueriesOn(datasets: Dataset[_]*)(body: Seq[StreamingQuery] 
=> Unit): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19268][SS] Disallow adaptive query execution for streaming queries

2017-01-23 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master e576c1ed7 -> 60bd91a34


[SPARK-19268][SS] Disallow adaptive query execution for streaming queries

## What changes were proposed in this pull request?

As adaptive query execution may change the number of partitions in different 
batches, it may break streaming queries. Hence, we should disallow this feature 
in Structured Streaming.

## How was this patch tested?

`test("SPARK-19268: Adaptive query execution should be disallowed")`.

Author: Shixiong Zhu 

Closes #16683 from zsxwing/SPARK-19268.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/60bd91a3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/60bd91a3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/60bd91a3

Branch: refs/heads/master
Commit: 60bd91a34078a9239fbf5e8f49ce8b680c11635d
Parents: e576c1e
Author: Shixiong Zhu 
Authored: Mon Jan 23 22:30:51 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Jan 23 22:30:51 2017 -0800

--
 .../spark/sql/streaming/StreamingQueryManager.scala |  6 ++
 .../sql/streaming/StreamingQueryManagerSuite.scala  | 12 +++-
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/60bd91a3/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
index 7b9770d..0b9406b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
@@ -230,6 +230,12 @@ class StreamingQueryManager private[sql] (sparkSession: 
SparkSession) {
   UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
 }
 
+if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
+  throw new AnalysisException(
+s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
+  "is not supported in streaming DataFrames/Datasets")
+}
+
 new StreamingQueryWrapper(new StreamExecution(
   sparkSession,
   userSpecifiedName.orNull,

http://git-wip-us.apache.org/repos/asf/spark/blob/60bd91a3/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
index 8e16fd4..f05e9d1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala
@@ -30,8 +30,9 @@ import org.scalatest.time.Span
 import org.scalatest.time.SpanSugar._
 
 import org.apache.spark.SparkException
-import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.{AnalysisException, Dataset}
 import org.apache.spark.sql.execution.streaming._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.streaming.util.BlockingSource
 import org.apache.spark.util.Utils
 
@@ -238,6 +239,15 @@ class StreamingQueryManagerSuite extends StreamTest with 
BeforeAndAfter {
 }
   }
 
+  test("SPARK-19268: Adaptive query execution should be disallowed") {
+withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
+  val e = intercept[AnalysisException] {
+
MemoryStream[Int].toDS.writeStream.queryName("test-query").format("memory").start()
+  }
+  assert(e.getMessage.contains(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key) &&
+e.getMessage.contains("not supported"))
+}
+  }
 
   /** Run a body of code by defining a query on each dataset */
   private def withQueriesOn(datasets: Dataset[_]*)(body: Seq[StreamingQuery] 
=> Unit): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: [SPARK-19139][CORE] New auth mechanism for transport library.

2017-01-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master d9783380f -> 8f3f73abc


http://git-wip-us.apache.org/repos/asf/spark/blob/8f3f73ab/common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java
--
diff --git 
a/common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java
 
b/common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java
new file mode 100644
index 000..21609d5
--- /dev/null
+++ 
b/common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.crypto;
+
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Lists;
+import io.netty.channel.Channel;
+import org.junit.After;
+import org.junit.Test;
+import static org.junit.Assert.*;
+import static org.mockito.Mockito.*;
+
+import org.apache.spark.network.TestUtils;
+import org.apache.spark.network.TransportContext;
+import org.apache.spark.network.client.RpcResponseCallback;
+import org.apache.spark.network.client.TransportClient;
+import org.apache.spark.network.client.TransportClientBootstrap;
+import org.apache.spark.network.sasl.SaslRpcHandler;
+import org.apache.spark.network.sasl.SaslServerBootstrap;
+import org.apache.spark.network.sasl.SecretKeyHolder;
+import org.apache.spark.network.server.RpcHandler;
+import org.apache.spark.network.server.StreamManager;
+import org.apache.spark.network.server.TransportServer;
+import org.apache.spark.network.server.TransportServerBootstrap;
+import org.apache.spark.network.util.JavaUtils;
+import org.apache.spark.network.util.MapConfigProvider;
+import org.apache.spark.network.util.TransportConf;
+
+public class AuthIntegrationSuite {
+
+  private AuthTestCtx ctx;
+
+  @After
+  public void cleanUp() throws Exception {
+if (ctx != null) {
+  ctx.close();
+}
+ctx = null;
+  }
+
+  @Test
+  public void testNewAuth() throws Exception {
+ctx = new AuthTestCtx();
+ctx.createServer("secret");
+ctx.createClient("secret");
+
+ByteBuffer reply = ctx.client.sendRpcSync(JavaUtils.stringToBytes("Ping"), 
5000);
+assertEquals("Pong", JavaUtils.bytesToString(reply));
+assertTrue(ctx.authRpcHandler.doDelegate);
+assertFalse(ctx.authRpcHandler.delegate instanceof SaslRpcHandler);
+  }
+
+  @Test
+  public void testAuthFailure() throws Exception {
+ctx = new AuthTestCtx();
+ctx.createServer("server");
+
+try {
+  ctx.createClient("client");
+  fail("Should have failed to create client.");
+} catch (Exception e) {
+  assertFalse(ctx.authRpcHandler.doDelegate);
+  assertFalse(ctx.serverChannel.isActive());
+}
+  }
+
+  @Test
+  public void testSaslServerFallback() throws Exception {
+ctx = new AuthTestCtx();
+ctx.createServer("secret", true);
+ctx.createClient("secret", false);
+
+ByteBuffer reply = ctx.client.sendRpcSync(JavaUtils.stringToBytes("Ping"), 
5000);
+assertEquals("Pong", JavaUtils.bytesToString(reply));
+  }
+
+  @Test
+  public void testSaslClientFallback() throws Exception {
+ctx = new AuthTestCtx();
+ctx.createServer("secret", false);
+ctx.createClient("secret", true);
+
+ByteBuffer reply = ctx.client.sendRpcSync(JavaUtils.stringToBytes("Ping"), 
5000);
+assertEquals("Pong", JavaUtils.bytesToString(reply));
+  }
+
+  @Test
+  public void testAuthReplay() throws Exception {
+// This test covers the case where an attacker replays a challenge message 
sniffed from the
+// network, but doesn't know the actual secret. The server should close 
the connection as
+// soon as a message is sent after authentication is performed. This is 
emulated by removing
+// the client encryption handler after authentication.
+ctx = new AuthTestCtx();
+ctx.createServer("secret");
+ctx.createClient("secret");
+
+assertNotNull(ctx.client.getChannel().pipeline()
+  .remove(TransportCipher.ENCRYPTION_HANDLER_NAME));

[2/2] spark git commit: [SPARK-19139][CORE] New auth mechanism for transport library.

2017-01-24 Thread zsxwing

[SPARK-19139][CORE] New auth mechanism for transport library.

This change introduces a new auth mechanism to the transport library,
to be used when users enable strong encryption. This auth mechanism
has better security than the currently used DIGEST-MD5.

The new protocol uses symmetric key encryption to mutually authenticate
the endpoints, and is very loosely based on ISO/IEC 9798.

The new protocol falls back to SASL when it thinks the remote end is old.
Because SASL does not support asking the server for multiple auth protocols,
which would mean we could re-use the existing SASL code by just adding a
new SASL provider, the protocol is implemented outside of the SASL API
to avoid the boilerplate of adding a new provider.

Details of the auth protocol are discussed in the included README.md
file.

This change partly undos the changes added in SPARK-13331; AES encryption
is now decoupled from SASL authentication. The encryption code itself,
though, has been re-used as part of this change.

## How was this patch tested?

- Unit tests
- Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
- Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled

Author: Marcelo Vanzin 

Closes #16521 from vanzin/SPARK-19139.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8f3f73ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8f3f73ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8f3f73ab

Branch: refs/heads/master
Commit: 8f3f73abc1fe62496722476460c174af0250e3fe
Parents: d978338
Author: Marcelo Vanzin 
Authored: Tue Jan 24 10:44:04 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 24 10:44:04 2017 -0800

--
 .../network/crypto/AuthClientBootstrap.java | 128 +
 .../apache/spark/network/crypto/AuthEngine.java | 284 +++
 .../spark/network/crypto/AuthRpcHandler.java| 170 +++
 .../network/crypto/AuthServerBootstrap.java |  55 
 .../spark/network/crypto/ClientChallenge.java   | 101 +++
 .../org/apache/spark/network/crypto/README.md   | 158 +++
 .../spark/network/crypto/ServerResponse.java|  85 ++
 .../spark/network/crypto/TransportCipher.java   | 257 +
 .../spark/network/sasl/SaslClientBootstrap.java |  36 +--
 .../spark/network/sasl/SaslRpcHandler.java  |  41 +--
 .../spark/network/sasl/aes/AesCipher.java   | 281 --
 .../network/sasl/aes/AesConfigMessage.java  | 101 ---
 .../spark/network/util/TransportConf.java   |  92 --
 .../spark/network/crypto/AuthEngineSuite.java   | 109 +++
 .../network/crypto/AuthIntegrationSuite.java| 213 ++
 .../spark/network/crypto/AuthMessagesSuite.java |  80 ++
 .../spark/network/sasl/SparkSaslSuite.java  |  97 +--
 .../network/shuffle/ExternalShuffleClient.java  |  19 +-
 .../mesos/MesosExternalShuffleClient.java   |   5 +-
 .../ExternalShuffleIntegrationSuite.java|   4 +-
 .../shuffle/ExternalShuffleSecuritySuite.java   |   9 +-
 .../spark/network/yarn/YarnShuffleService.java  |   4 +-
 .../org/apache/spark/SecurityManager.scala  |  11 +-
 .../main/scala/org/apache/spark/SparkConf.scala |   5 +
 .../main/scala/org/apache/spark/SparkEnv.scala  |   2 +-
 .../spark/deploy/ExternalShuffleService.scala   |  10 +-
 .../apache/spark/internal/config/package.scala  |  16 ++
 .../netty/NettyBlockTransferService.scala   |   7 +-
 .../apache/spark/rpc/netty/NettyRpcEnv.scala|   8 +-
 .../org/apache/spark/storage/BlockManager.scala |   3 +-
 .../scala/org/apache/spark/SparkConfSuite.scala |  19 ++
 .../netty/NettyBlockTransferSecuritySuite.scala |  14 +
 .../org/apache/spark/rpc/RpcEnvSuite.scala  |  54 +++-
 docs/configuration.md   |  50 ++--
 .../MesosCoarseGrainedSchedulerBackend.scala|   3 +-
 35 files changed, 1909 insertions(+), 622 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8f3f73ab/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java
 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java
new file mode 100644
index 000..980525d
--- /dev/null
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Vers

spark git commit: [SPARK-19330][DSTREAMS] Also show tooltip for successful batches

2017-01-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 15ef3740d -> 40a4cfc7c


[SPARK-19330][DSTREAMS] Also show tooltip for successful batches

## What changes were proposed in this pull request?

### Before
![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)

### After
![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png)

## How was this patch tested?

Manually

Author: Liwei Lin 

Closes #16673 from lw-lin/streaming.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40a4cfc7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40a4cfc7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40a4cfc7

Branch: refs/heads/master
Commit: 40a4cfc7c7911107d1cf7a2663469031dcf1f576
Parents: 15ef374
Author: Liwei Lin 
Authored: Tue Jan 24 16:36:17 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 24 16:36:17 2017 -0800

--
 .../org/apache/spark/streaming/ui/static/streaming-page.js   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/40a4cfc7/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
--
diff --git 
a/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
 
b/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
index f82323a..d004f34 100644
--- 
a/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
+++ 
b/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
@@ -169,7 +169,7 @@ function drawTimeline(id, data, minX, maxX, minY, maxY, 
unitY, batchInterval) {
 .style("cursor", "pointer")
 .attr("cx", function(d) { return x(d.x); })
 .attr("cy", function(d) { return y(d.y); })
-.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : "0";})
+.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : "3";})
 .on('mouseover', function(d) {
 var tip = formatYValue(d.y) + " " + unitY + " at " + 
timeFormat[d.x];
 showBootstrapTooltip(d3.select(this).node(), tip);
@@ -187,7 +187,7 @@ function drawTimeline(id, data, minX, maxX, minY, maxY, 
unitY, batchInterval) {
 .attr("stroke", function(d) { return isFailedBatch(d.x) ? 
"red" : "white";})
 .attr("fill", function(d) { return isFailedBatch(d.x) ? 
"red" : "white";})
 .attr("opacity", function(d) { return isFailedBatch(d.x) ? 
"1" : "0";})
-.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : 
"0";});
+.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : 
"3";});
 })
 .on("click", function(d) {
 if (lastTimeout != null) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19330][DSTREAMS] Also show tooltip for successful batches

2017-01-24 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 b94fb284b -> c13378796


[SPARK-19330][DSTREAMS] Also show tooltip for successful batches

## What changes were proposed in this pull request?

### Before
![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)

### After
![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png)

## How was this patch tested?

Manually

Author: Liwei Lin 

Closes #16673 from lw-lin/streaming.

(cherry picked from commit 40a4cfc7c7911107d1cf7a2663469031dcf1f576)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c1337879
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c1337879
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c1337879

Branch: refs/heads/branch-2.1
Commit: c133787965e65e19c0aab636c941b5673e6a68e5
Parents: b94fb28
Author: Liwei Lin 
Authored: Tue Jan 24 16:36:17 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Jan 24 16:36:24 2017 -0800

--
 .../org/apache/spark/streaming/ui/static/streaming-page.js   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c1337879/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
--
diff --git 
a/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
 
b/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
index f82323a..d004f34 100644
--- 
a/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
+++ 
b/streaming/src/main/resources/org/apache/spark/streaming/ui/static/streaming-page.js
@@ -169,7 +169,7 @@ function drawTimeline(id, data, minX, maxX, minY, maxY, 
unitY, batchInterval) {
 .style("cursor", "pointer")
 .attr("cx", function(d) { return x(d.x); })
 .attr("cy", function(d) { return y(d.y); })
-.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : "0";})
+.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : "3";})
 .on('mouseover', function(d) {
 var tip = formatYValue(d.y) + " " + unitY + " at " + 
timeFormat[d.x];
 showBootstrapTooltip(d3.select(this).node(), tip);
@@ -187,7 +187,7 @@ function drawTimeline(id, data, minX, maxX, minY, maxY, 
unitY, batchInterval) {
 .attr("stroke", function(d) { return isFailedBatch(d.x) ? 
"red" : "white";})
 .attr("fill", function(d) { return isFailedBatch(d.x) ? 
"red" : "white";})
 .attr("opacity", function(d) { return isFailedBatch(d.x) ? 
"1" : "0";})
-.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : 
"0";});
+.attr("r", function(d) { return isFailedBatch(d.x) ? "2" : 
"3";});
 })
 .on("click", function(d) {
 if (lastTimeout != null) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19365][CORE] Optimize RequestMessage serialization

2017-01-27 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master a7ab6f9a8 -> 21aa8c32b


[SPARK-19365][CORE] Optimize RequestMessage serialization

## What changes were proposed in this pull request?

Right now Netty PRC serializes `RequestMessage` using Java serialization, and 
the size of a single message (e.g., RequestMessage(..., "hello")`) is almost 
1KB.

This PR optimizes it by serializing `RequestMessage` manually (eliminate 
unnecessary information from most messages, e.g., class names of 
`RequestMessage`, `NettyRpcEndpointRef`, ...), and reduces the above message 
size to 100+ bytes.

## How was this patch tested?

Jenkins

I did a simple test to measure the improvement:

Before
```
$ bin/spark-shell --master local-cluster[1,4,1024]
...
scala> for (i <- 1 to 10) {
 |   val start = System.nanoTime
 |   val s = sc.parallelize(1 to 100, 10 * 1000).count()
 |   val end = System.nanoTime
 |   println(s"$i\t" + ((end - start)/1000/1000))
 | }
1   6830
2   4353
3   3322
4   3107
5   3235
6   3139
7   3156
8   3166
9   3091
10  3029
```
After:
```
$ bin/spark-shell --master local-cluster[1,4,1024]
...
scala> for (i <- 1 to 10) {
 |   val start = System.nanoTime
 |   val s = sc.parallelize(1 to 100, 10 * 1000).count()
 |   val end = System.nanoTime
 |   println(s"$i\t" + ((end - start)/1000/1000))
 | }
1   6431
2   3643
3   2913
4   2679
5   2760
6   2710
7   2747
8   2793
9   2679
10  2651
```

I also captured the TCP packets for this test. Before this patch, the total 
size of TCP packets is ~1.5GB. After it, it reduces to ~1.2GB.

Author: Shixiong Zhu 

Closes #16706 from zsxwing/rpc-opt.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21aa8c32
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21aa8c32
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21aa8c32

Branch: refs/heads/master
Commit: 21aa8c32ba7a29aafc000ecce2e6c802ced6a009
Parents: a7ab6f9
Author: Shixiong Zhu 
Authored: Fri Jan 27 15:07:57 2017 -0800
Committer: Shixiong Zhu 
Committed: Fri Jan 27 15:07:57 2017 -0800

--
 .../apache/spark/rpc/RpcEndpointAddress.scala   |   5 +-
 .../apache/spark/rpc/netty/NettyRpcEnv.scala| 119 +++
 .../spark/rpc/netty/NettyRpcEnvSuite.scala  |  33 -
 .../spark/rpc/netty/NettyRpcHandlerSuite.scala  |   2 +-
 4 files changed, 132 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21aa8c32/core/src/main/scala/org/apache/spark/rpc/RpcEndpointAddress.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcEndpointAddress.scala 
b/core/src/main/scala/org/apache/spark/rpc/RpcEndpointAddress.scala
index b9db60a..fdbccc9 100644
--- a/core/src/main/scala/org/apache/spark/rpc/RpcEndpointAddress.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/RpcEndpointAddress.scala
@@ -25,10 +25,11 @@ import org.apache.spark.SparkException
  * The `rpcAddress` may be null, in which case the endpoint is registered via 
a client-only
  * connection and can only be reached via the client that sent the endpoint 
reference.
  *
- * @param rpcAddress The socket address of the endpoint.
+ * @param rpcAddress The socket address of the endpoint. It's `null` when this 
address pointing to
+ *   an endpoint in a client `NettyRpcEnv`.
  * @param name Name of the endpoint.
  */
-private[spark] case class RpcEndpointAddress(val rpcAddress: RpcAddress, val 
name: String) {
+private[spark] case class RpcEndpointAddress(rpcAddress: RpcAddress, name: 
String) {
 
   require(name != null, "RpcEndpoint name must be provided.")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/21aa8c32/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala 
b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
index 1e448b2..ff5e39a 100644
--- a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
@@ -37,8 +37,8 @@ import org.apache.spark.network.crypto.{AuthClientBootstrap, 
AuthServerBootstrap
 import org.apache.spark.network.netty.SparkTransportConf
 import org.apache.spark.network.server._
 import org.apache.spark.rpc._
-import org.apache.spark.serializer.{JavaSerializer, JavaSerializerInstance}
-import org.apache.spark.util.{ThreadUtils, Utils}
+import org.apache.spark.serializer.{JavaSerializer, JavaSerializerInstance, 
SerializationStream}
+import org.apac

spark git commit: [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

2017-02-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 61cdc8c7c -> f94646415


[SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo object 
by dropping the unnecessary details to reduce the memory usage. This patch adds 
the copying of the killed status to newTaskInfo object, this will correct the 
display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch 
Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write 
Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | |0.0 B / 0 
   |TaskKilled (killed intentionally)|
|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | |0.0 B / 0 
   |TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch 
Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write 
Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143|10 |0  |KILLED |NODE_LOCAL |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  | 0.0 B / 
0  | TaskKilled (killed intentionally)|
|156|11 |0  |KILLED |NODE_LOCAL |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  |0.0 B / 
0   | TaskKilled (killed intentionally)|

Author: Devaraj K 

Closes #16725 from devaraj-kavali/SPARK-19377.

(cherry picked from commit df4a27cc5cae8e251ba2a883bcc5f5ce9282f649)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9464641
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9464641
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9464641

Branch: refs/heads/branch-2.1
Commit: f946464155bb907482dc8d8a1b0964a925d04081
Parents: 61cdc8c
Author: Devaraj K 
Authored: Wed Feb 1 12:55:11 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 1 12:55:19 2017 -0800

--
 core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f9464641/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
index f4a0460..78113ac 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
@@ -176,6 +176,7 @@ private[spark] object UIData {
   }
   newTaskInfo.finishTime = taskInfo.finishTime
   newTaskInfo.failed = taskInfo.failed
+  newTaskInfo.killed = taskInfo.killed
   newTaskInfo
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

2017-02-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 5ed397baa -> df4a27cc5


[SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo object 
by dropping the unnecessary details to reduce the memory usage. This patch adds 
the copying of the killed status to newTaskInfo object, this will correct the 
display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch 
Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write 
Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | |0.0 B / 0 
   |TaskKilled (killed intentionally)|
|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | |0.0 B / 0 
   |TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch 
Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write 
Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143|10 |0  |KILLED |NODE_LOCAL |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  | 0.0 B / 
0  | TaskKilled (killed intentionally)|
|156|11 |0  |KILLED |NODE_LOCAL |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  |0.0 B / 
0   | TaskKilled (killed intentionally)|

Author: Devaraj K 

Closes #16725 from devaraj-kavali/SPARK-19377.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/df4a27cc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/df4a27cc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/df4a27cc

Branch: refs/heads/master
Commit: df4a27cc5cae8e251ba2a883bcc5f5ce9282f649
Parents: 5ed397b
Author: Devaraj K 
Authored: Wed Feb 1 12:55:11 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 1 12:55:11 2017 -0800

--
 core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/df4a27cc/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
index 201e619..073f7ed 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
@@ -185,6 +185,7 @@ private[spark] object UIData {
   })
   newTaskInfo.finishTime = taskInfo.finishTime
   newTaskInfo.failed = taskInfo.failed
+  newTaskInfo.killed = taskInfo.killed
   newTaskInfo
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

2017-02-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 f94646415 -> 7c23bd49e


[SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has not 
yet been set.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
```

It's better to provide a meaningful message.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16773 from zsxwing/connect-timeout.

(cherry picked from commit 8303e20c45153f91e585e230caa29b728a4d8c6c)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c23bd49
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c23bd49
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c23bd49

Branch: refs/heads/branch-2.1
Commit: 7c23bd49e826fc2b7f132ffac2e55a71905abe96
Parents: f946464
Author: Shixiong Zhu 
Authored: Wed Feb 1 21:39:21 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 1 21:39:30 2017 -0800

--
 core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7c23bd49/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala 
b/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
index 6c090ad..a7b7f58 100644
--- a/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
@@ -56,7 +56,7 @@ private[netty] case class RpcOutboxMessage(
 content: ByteBuffer,
 _onFailure: (Throwable) => Unit,
 _onSuccess: (TransportClient, ByteBuffer) => Unit)
-  extends OutboxMessage with RpcResponseCallback {
+  extends OutboxMessage with RpcResponseCallback with Logging {
 
   private var client: TransportClient = _
   private var requestId: Long = _
@@ -67,8 +67,11 @@ private[netty] case class RpcOutboxMessage(
   }
 
   def onTimeout(): Unit = {
-require(client != null, "TransportClient has not yet been set.")
-client.removeRpcRequest(requestId)
+if (client != null) {
+  client.removeRpcRequest(requestId)
+} else {
+  logError("Ask timeout before connecting successfully")
+}
   }
 
   override def onFailure(e: Throwable): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

2017-02-01 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master b0985764f -> 8303e20c4


[SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has not 
yet been set.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
```

It's better to provide a meaningful message.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16773 from zsxwing/connect-timeout.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8303e20c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8303e20c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8303e20c

Branch: refs/heads/master
Commit: 8303e20c45153f91e585e230caa29b728a4d8c6c
Parents: b098576
Author: Shixiong Zhu 
Authored: Wed Feb 1 21:39:21 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 1 21:39:21 2017 -0800

--
 core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8303e20c/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala 
b/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
index 6c090ad..a7b7f58 100644
--- a/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/netty/Outbox.scala
@@ -56,7 +56,7 @@ private[netty] case class RpcOutboxMessage(
 content: ByteBuffer,
 _onFailure: (Throwable) => Unit,
 _onSuccess: (TransportClient, ByteBuffer) => Unit)
-  extends OutboxMessage with RpcResponseCallback {
+  extends OutboxMessage with RpcResponseCallback with Logging {
 
   private var client: TransportClient = _
   private var requestId: Long = _
@@ -67,8 +67,11 @@ private[netty] case class RpcOutboxMessage(
   }
 
   def onTimeout(): Unit = {
-require(client != null, "TransportClient has not yet been set.")
-client.removeRpcRequest(requestId)
+if (client != null) {
+  client.removeRpcRequest(requestId)
+} else {
+  logError("Ask timeout before connecting successfully")
+}
   }
 
   override def onFailure(e: Throwable): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19437] Rectify spark executor id in HeartbeatReceiverSuite.

2017-02-02 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 1d5d2a9d0 -> c86a57f4d


[SPARK-19437] Rectify spark executor id in HeartbeatReceiverSuite.

## What changes were proposed in this pull request?

The current code in `HeartbeatReceiverSuite`, executorId is set as below:
```
  private val executorId1 = "executor-1"
  private val executorId2 = "executor-2"
```

The executorId is sent to driver when register as below:

```
test("expire dead hosts should kill executors with replacement (SPARK-8119)")  {
  ...
  fakeSchedulerBackend.driverEndpoint.askSync[Boolean](
  RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 0, 
Map.empty))
  ...
}
```

Receiving `RegisterExecutor` in `CoarseGrainedSchedulerBackend`, the executorId 
will be compared with `currentExecutorIdCounter` as below:
```
case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls)  =>
  if (executorDataMap.contains(executorId)) {
executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + 
executorId))
context.reply(true)
  } else {
  ...
  executorDataMap.put(executorId, data)
  if (currentExecutorIdCounter < executorId.toInt) {
currentExecutorIdCounter = executorId.toInt
  }
  ...
```

`executorId.toInt` will cause NumberformatException.

This unit test can pass currently because of `askWithRetry`, when catching 
exception, RPC will call again, thus it will go `if` branch and return true.

**To fix**
Rectify executorId and replace `askWithRetry` with `askSync`, refer to 
https://github.com/apache/spark/pull/16690
## How was this patch tested?
This fix is for unit test and no need to add another one.(If this patch 
involves UI changes, please attach a screenshot; otherwise, remove this)

Author: jinxing 

Closes #16779 from jinxing64/SPARK-19437.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c86a57f4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c86a57f4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c86a57f4

Branch: refs/heads/master
Commit: c86a57f4d1a39ab9602733a09d8fec13506cc6d4
Parents: 1d5d2a9
Author: jinxing 
Authored: Thu Feb 2 23:18:16 2017 -0800
Committer: Shixiong Zhu 
Committed: Thu Feb 2 23:18:16 2017 -0800

--
 .../apache/spark/HeartbeatReceiverSuite.scala   | 26 ++--
 1 file changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c86a57f4/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala 
b/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala
index 7b6a231..8891648 100644
--- a/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala
+++ b/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala
@@ -46,8 +46,8 @@ class HeartbeatReceiverSuite
   with PrivateMethodTester
   with LocalSparkContext {
 
-  private val executorId1 = "executor-1"
-  private val executorId2 = "executor-2"
+  private val executorId1 = "1"
+  private val executorId2 = "2"
 
   // Shared state that must be reset before and after each test
   private var scheduler: TaskSchedulerImpl = null
@@ -93,12 +93,12 @@ class HeartbeatReceiverSuite
 
   test("task scheduler is set correctly") {
 assert(heartbeatReceiver.scheduler === null)
-heartbeatReceiverRef.askWithRetry[Boolean](TaskSchedulerIsSet)
+heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet)
 assert(heartbeatReceiver.scheduler !== null)
   }
 
   test("normal heartbeat") {
-heartbeatReceiverRef.askWithRetry[Boolean](TaskSchedulerIsSet)
+heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet)
 addExecutorAndVerify(executorId1)
 addExecutorAndVerify(executorId2)
 triggerHeartbeat(executorId1, executorShouldReregister = false)
@@ -116,14 +116,14 @@ class HeartbeatReceiverSuite
   }
 
   test("reregister if heartbeat from unregistered executor") {
-heartbeatReceiverRef.askWithRetry[Boolean](TaskSchedulerIsSet)
+heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet)
 // Received heartbeat from unknown executor, so we ask it to re-register
 triggerHeartbeat(executorId1, executorShouldReregister = true)
 assert(getTrackedExecutors.isEmpty)
   }
 
   test("reregister if heartbeat from removed executor") {
-heartbeatReceiverRef.askWithRetry[Boolean](TaskSchedulerIsSet)
+heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet)
 addExecutorAndVerify(executorId1)
 addExecutorAndVerify(executorId2)
 // Remove the second executor but not the first
@@ -140,7 +140,7 @@ class HeartbeatReceiverSuite
 
   test("expire dead hosts") {
 val executorTimeout = heartbeatReceiver.invokePrivate(_executorTimeo

spark git commit: [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from uri scheme

2017-02-06 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master fab0d62a7 -> 7a0a630e0


[SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from 
uri scheme

## What changes were proposed in this pull request?

```
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
s3a://**/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:100)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
```

Can easily replicate on spark standalone cluster by providing checkpoint 
location uri scheme anything other than "file://" and not overriding in config.

WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in 
sparkConf or spark-default.conf

## How was this patch tested?

existing ut

Author: uncleGen 

Closes #16815 from uncleGen/SPARK-19407.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7a0a630e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7a0a630e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7a0a630e

Branch: refs/heads/master
Commit: 7a0a630e0f699017c7d0214923cd4aa0227e62ff
Parents: fab0d62
Author: uncleGen 
Authored: Mon Feb 6 21:03:20 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Feb 6 21:03:20 2017 -0800

--
 .../apache/spark/sql/execution/streaming/StreamMetadata.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7a0a630e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
index 7807c9f..0bc54ea 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
@@ -47,7 +47,7 @@ object StreamMetadata extends Logging {
 
   /** Read the metadata from file if it exists */
   def read(metadataFile: Path, hadoopConf: Configuration): 
Option[StreamMetadata] = {
-val fs = FileSystem.get(hadoopConf)
+val fs = metadataFile.getFileSystem(hadoopConf)
 if (fs.exists(metadataFile)) {
   var input: FSDataInputStream = null
   try {
@@ -72,7 +72,7 @@ object StreamMetadata extends Logging {
   hadoopConf: Configuration): Unit = {
 var output: FSDataOutputStream = null
 try {
-  val fs = FileSystem.get(hadoopConf)
+  val fs = metadataFile.getFileSystem(hadoopConf)
   output = fs.create(metadataFile)
   val writer = new OutputStreamWriter(output)
   Serialization.write(metadata, writer)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from uri scheme

2017-02-06 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 f55bd4c73 -> 62fab5bee


[SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from 
uri scheme

## What changes were proposed in this pull request?

```
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
s3a://**/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:100)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
```

Can easily replicate on spark standalone cluster by providing checkpoint 
location uri scheme anything other than "file://" and not overriding in config.

WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in 
sparkConf or spark-default.conf

## How was this patch tested?

existing ut

Author: uncleGen 

Closes #16815 from uncleGen/SPARK-19407.

(cherry picked from commit 7a0a630e0f699017c7d0214923cd4aa0227e62ff)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/62fab5be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/62fab5be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/62fab5be

Branch: refs/heads/branch-2.1
Commit: 62fab5beee147c90d8b7f8092b4ee76ba611ee8e
Parents: f55bd4c
Author: uncleGen 
Authored: Mon Feb 6 21:03:20 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Feb 6 21:03:31 2017 -0800

--
 .../apache/spark/sql/execution/streaming/StreamMetadata.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/62fab5be/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
index 7807c9f..0bc54ea 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
@@ -47,7 +47,7 @@ object StreamMetadata extends Logging {
 
   /** Read the metadata from file if it exists */
   def read(metadataFile: Path, hadoopConf: Configuration): 
Option[StreamMetadata] = {
-val fs = FileSystem.get(hadoopConf)
+val fs = metadataFile.getFileSystem(hadoopConf)
 if (fs.exists(metadataFile)) {
   var input: FSDataInputStream = null
   try {
@@ -72,7 +72,7 @@ object StreamMetadata extends Logging {
   hadoopConf: Configuration): Unit = {
 var output: FSDataOutputStream = null
 try {
-  val fs = FileSystem.get(hadoopConf)
+  val fs = metadataFile.getFileSystem(hadoopConf)
   output = fs.create(metadataFile)
   val writer = new OutputStreamWriter(output)
   Serialization.write(metadata, writer)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-18682][SS] Batch Source for Kafka

2017-02-07 Thread zsxwing

[SPARK-18682][SS] Batch Source for Kafka

## What changes were proposed in this pull request?

Today, you can start a stream that reads from kafka. However, given kafka's 
configurable retention period, it seems like sometimes you might just want to 
read all of the data that is available now. As such we should add a version 
that works with spark.read as well.
The options should be the same as the streaming kafka source, with the 
following differences:
startingOffsets should default to earliest, and should not allow latest (which 
would always be empty).
endingOffsets should also be allowed and should default to latest. the same 
assign json format as startingOffsets should also be accepted.
It would be really good, if things like .limit(n) were enough to prevent all 
the data from being read (this might just work).

## How was this patch tested?

KafkaRelationSuite was added for testing batch queries via KafkaUtils.

Author: Tyson Condie 

Closes #16686 from tcondie/SPARK-18682.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8df0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8df0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8df0

Branch: refs/heads/master
Commit: 8df03489aec0d68f7d930afdc4f7d50e0b41
Parents: 73ee739
Author: Tyson Condie 
Authored: Tue Feb 7 14:31:23 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 14:31:23 2017 -0800

--
 .../sql/kafka010/CachedKafkaConsumer.scala  | 102 --
 .../spark/sql/kafka010/ConsumerStrategy.scala   |  84 +
 .../sql/kafka010/KafkaOffsetRangeLimit.scala|  51 +++
 .../spark/sql/kafka010/KafkaOffsetReader.scala  | 312 ++
 .../spark/sql/kafka010/KafkaRelation.scala  | 124 +++
 .../apache/spark/sql/kafka010/KafkaSource.scala | 323 +++
 .../sql/kafka010/KafkaSourceProvider.scala  | 262 ++-
 .../spark/sql/kafka010/KafkaSourceRDD.scala |  63 +++-
 .../spark/sql/kafka010/StartingOffsets.scala|  32 --
 .../spark/sql/kafka010/KafkaRelationSuite.scala | 233 +
 .../spark/sql/kafka010/KafkaSourceSuite.scala   |   3 +
 .../spark/sql/kafka010/KafkaTestUtils.scala |  21 +-
 12 files changed, 1180 insertions(+), 430 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8df0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
index 3f396a7..15b2825 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
@@ -44,6 +44,9 @@ private[kafka010] case class CachedKafkaConsumer private(
 
   private var consumer = createConsumer
 
+  /** indicates whether this consumer is in use or not */
+  private var inuse = true
+
   /** Iterator to the already fetch data */
   private var fetchedData = 
ju.Collections.emptyIterator[ConsumerRecord[Array[Byte], Array[Byte]]]
   private var nextOffsetInFetchedData = UNKNOWN_OFFSET
@@ -57,6 +60,20 @@ private[kafka010] case class CachedKafkaConsumer private(
 c
   }
 
+  case class AvailableOffsetRange(earliest: Long, latest: Long)
+
+  /**
+   * Return the available offset range of the current partition. It's a pair 
of the earliest offset
+   * and the latest offset.
+   */
+  def getAvailableOffsetRange(): AvailableOffsetRange = {
+consumer.seekToBeginning(Set(topicPartition).asJava)
+val earliestOffset = consumer.position(topicPartition)
+consumer.seekToEnd(Set(topicPartition).asJava)
+val latestOffset = consumer.position(topicPartition)
+AvailableOffsetRange(earliestOffset, latestOffset)
+  }
+
   /**
* Get the record for the given offset if available. Otherwise it will 
either throw error
* (if failOnDataLoss = true), or return the next available offset within 
[offset, untilOffset),
@@ -107,9 +124,9 @@ private[kafka010] case class CachedKafkaConsumer private(
* `UNKNOWN_OFFSET`.
*/
   private def getEarliestAvailableOffsetBetween(offset: Long, untilOffset: 
Long): Long = {
-val (earliestOffset, latestOffset) = getAvailableOffsetRange()
-logWarning(s"Some data may be lost. Recovering from the earliest offset: 
$earliestOffset")
-if (offset >= latestOffset || earliestOffset >= untilOffset) {
+val range = getAvailableOffsetRange()
+logWarning(s"Some data may be lost. Recovering from the earliest offset: 
${range.earliest}")
+if (offset >= range.latest || range.e

[1/2] spark git commit: [SPARK-18682][SS] Batch Source for Kafka

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 73ee73945 -> 8df03


http://git-wip-us.apache.org/repos/asf/spark/blob/8df0/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 544fbc5..211c8a5 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -384,6 +384,9 @@ class KafkaSourceSuite extends KafkaSourceTest {
   }
 }
 
+// Specifying an ending offset
+testBadOptions("endingOffsets" -> "latest")("Ending offset not valid in 
streaming queries")
+
 // No strategy specified
 testBadOptions()("options must be specified", "subscribe", 
"subscribePattern")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8df0/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
index 7e60410..2ce2760 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
@@ -50,7 +50,7 @@ import org.apache.spark.SparkConf
  *
  * The reason to put Kafka test utility class in src is to test Python related 
Kafka APIs.
  */
-class KafkaTestUtils extends Logging {
+class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends 
Logging {
 
   // Zookeeper related configurations
   private val zkHost = "localhost"
@@ -249,6 +249,24 @@ class KafkaTestUtils extends Logging {
 offsets
   }
 
+  def cleanupLogs(): Unit = {
+server.logManager.cleanupLogs()
+  }
+
+  def getEarliestOffsets(topics: Set[String]): Map[TopicPartition, Long] = {
+val kc = new KafkaConsumer[String, String](consumerConfiguration)
+logInfo("Created consumer to get earliest offsets")
+kc.subscribe(topics.asJavaCollection)
+kc.poll(0)
+val partitions = kc.assignment()
+kc.pause(partitions)
+kc.seekToBeginning(partitions)
+val offsets = partitions.asScala.map(p => p -> kc.position(p)).toMap
+kc.close()
+logInfo("Closed consumer to get earliest offsets")
+offsets
+  }
+
   def getLatestOffsets(topics: Set[String]): Map[TopicPartition, Long] = {
 val kc = new KafkaConsumer[String, String](consumerConfiguration)
 logInfo("Created consumer to get latest offsets")
@@ -274,6 +292,7 @@ class KafkaTestUtils extends Logging {
 props.put("log.flush.interval.messages", "1")
 props.put("replica.socket.timeout.ms", "1500")
 props.put("delete.topic.enable", "true")
+props.putAll(withBrokerProps.asJava)
 props
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-18682][SS] Batch Source for Kafka

2017-02-07 Thread zsxwing

[SPARK-18682][SS] Batch Source for Kafka

Today, you can start a stream that reads from kafka. However, given kafka's 
configurable retention period, it seems like sometimes you might just want to 
read all of the data that is available now. As such we should add a version 
that works with spark.read as well.
The options should be the same as the streaming kafka source, with the 
following differences:
startingOffsets should default to earliest, and should not allow latest (which 
would always be empty).
endingOffsets should also be allowed and should default to latest. the same 
assign json format as startingOffsets should also be accepted.
It would be really good, if things like .limit(n) were enough to prevent all 
the data from being read (this might just work).

KafkaRelationSuite was added for testing batch queries via KafkaUtils.

Author: Tyson Condie 

Closes #16686 from tcondie/SPARK-18682.

(cherry picked from commit 8df03489aec0d68f7d930afdc4f7d50e0b41)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e642a07d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e642a07d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e642a07d

Branch: refs/heads/branch-2.1
Commit: e642a07d57798f98b25ba08ed7ae3abe0f597941
Parents: dd1abef
Author: Tyson Condie 
Authored: Tue Feb 7 14:31:23 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 14:44:58 2017 -0800

--
 .../sql/kafka010/CachedKafkaConsumer.scala  | 102 --
 .../spark/sql/kafka010/ConsumerStrategy.scala   |  84 +
 .../sql/kafka010/KafkaOffsetRangeLimit.scala|  51 +++
 .../spark/sql/kafka010/KafkaOffsetReader.scala  | 312 ++
 .../spark/sql/kafka010/KafkaRelation.scala  | 124 +++
 .../apache/spark/sql/kafka010/KafkaSource.scala | 323 +++
 .../sql/kafka010/KafkaSourceProvider.scala  | 262 ++-
 .../spark/sql/kafka010/KafkaSourceRDD.scala |  63 +++-
 .../spark/sql/kafka010/StartingOffsets.scala|  32 --
 .../spark/sql/kafka010/KafkaRelationSuite.scala | 233 +
 .../spark/sql/kafka010/KafkaSourceSuite.scala   |   3 +
 .../spark/sql/kafka010/KafkaTestUtils.scala |  21 +-
 12 files changed, 1180 insertions(+), 430 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e642a07d/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
index 3f396a7..15b2825 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
@@ -44,6 +44,9 @@ private[kafka010] case class CachedKafkaConsumer private(
 
   private var consumer = createConsumer
 
+  /** indicates whether this consumer is in use or not */
+  private var inuse = true
+
   /** Iterator to the already fetch data */
   private var fetchedData = 
ju.Collections.emptyIterator[ConsumerRecord[Array[Byte], Array[Byte]]]
   private var nextOffsetInFetchedData = UNKNOWN_OFFSET
@@ -57,6 +60,20 @@ private[kafka010] case class CachedKafkaConsumer private(
 c
   }
 
+  case class AvailableOffsetRange(earliest: Long, latest: Long)
+
+  /**
+   * Return the available offset range of the current partition. It's a pair 
of the earliest offset
+   * and the latest offset.
+   */
+  def getAvailableOffsetRange(): AvailableOffsetRange = {
+consumer.seekToBeginning(Set(topicPartition).asJava)
+val earliestOffset = consumer.position(topicPartition)
+consumer.seekToEnd(Set(topicPartition).asJava)
+val latestOffset = consumer.position(topicPartition)
+AvailableOffsetRange(earliestOffset, latestOffset)
+  }
+
   /**
* Get the record for the given offset if available. Otherwise it will 
either throw error
* (if failOnDataLoss = true), or return the next available offset within 
[offset, untilOffset),
@@ -107,9 +124,9 @@ private[kafka010] case class CachedKafkaConsumer private(
* `UNKNOWN_OFFSET`.
*/
   private def getEarliestAvailableOffsetBetween(offset: Long, untilOffset: 
Long): Long = {
-val (earliestOffset, latestOffset) = getAvailableOffsetRange()
-logWarning(s"Some data may be lost. Recovering from the earliest offset: 
$earliestOffset")
-if (offset >= latestOffset || earliestOffset >= untilOffset) {
+val range = getAvailableOffsetRange()
+logWarning(s"Some data may be lost. Recovering from the earliest offset: 
${range.earliest}")
+if (offset >= rang

[1/2] spark git commit: [SPARK-18682][SS] Batch Source for Kafka

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 dd1abef13 -> e642a07d5


http://git-wip-us.apache.org/repos/asf/spark/blob/e642a07d/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
index 544fbc5..211c8a5 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
@@ -384,6 +384,9 @@ class KafkaSourceSuite extends KafkaSourceTest {
   }
 }
 
+// Specifying an ending offset
+testBadOptions("endingOffsets" -> "latest")("Ending offset not valid in 
streaming queries")
+
 // No strategy specified
 testBadOptions()("options must be specified", "subscribe", 
"subscribePattern")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e642a07d/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
index fd1689a..c2cbd86 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
@@ -50,7 +50,7 @@ import org.apache.spark.SparkConf
  *
  * The reason to put Kafka test utility class in src is to test Python related 
Kafka APIs.
  */
-class KafkaTestUtils extends Logging {
+class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends 
Logging {
 
   // Zookeeper related configurations
   private val zkHost = "localhost"
@@ -238,6 +238,24 @@ class KafkaTestUtils extends Logging {
 offsets
   }
 
+  def cleanupLogs(): Unit = {
+server.logManager.cleanupLogs()
+  }
+
+  def getEarliestOffsets(topics: Set[String]): Map[TopicPartition, Long] = {
+val kc = new KafkaConsumer[String, String](consumerConfiguration)
+logInfo("Created consumer to get earliest offsets")
+kc.subscribe(topics.asJavaCollection)
+kc.poll(0)
+val partitions = kc.assignment()
+kc.pause(partitions)
+kc.seekToBeginning(partitions)
+val offsets = partitions.asScala.map(p => p -> kc.position(p)).toMap
+kc.close()
+logInfo("Closed consumer to get earliest offsets")
+offsets
+  }
+
   def getLatestOffsets(topics: Set[String]): Map[TopicPartition, Long] = {
 val kc = new KafkaConsumer[String, String](consumerConfiguration)
 logInfo("Created consumer to get latest offsets")
@@ -263,6 +281,7 @@ class KafkaTestUtils extends Logging {
 props.put("log.flush.interval.messages", "1")
 props.put("replica.socket.timeout.ms", "1500")
 props.put("delete.topic.enable", "true")
+props.putAll(withBrokerProps.asJava)
 props
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master e33aaa2ac -> aeb80348d


[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations

## What changes were proposed in this pull request?

`mapGroupsWithState` is a new API for arbitrary stateful operations in 
Structured Streaming, similar to `DStream.mapWithState`

*Requirements*
- Users should be able to specify a function that can do the following
- Access the input row corresponding to a key
- Access the previous state corresponding to a key
- Optionally, update or remove the state
- Output any number of new rows (or none at all)

*Proposed API*
```
//  New methods on KeyValueGroupedDataset 
class KeyValueGroupedDataset[K, V] {
// Scala friendly
def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
KeyedState[S]) => U)
def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
Iterator[V], KeyedState[S]) => Iterator[U])
// Java friendly
   def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
   def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, 
V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
}

// --- New Java-friendly function classes ---
public interface MapGroupsWithStateFunction extends Serializable {
  R call(K key, Iterator values, state: KeyedState) throws Exception;
}
public interface FlatMapGroupsWithStateFunction extends 
Serializable {
  Iterator call(K key, Iterator values, state: KeyedState) throws 
Exception;
}

// -- Wrapper class for state data --
trait State[S] {
def exists(): Boolean
def get(): S// throws Exception is state does not 
exist
def getOption(): Option[S]
def update(newState: S): Unit
def remove(): Unit  // exists() will be false after this
}
```

Key Semantics of the State class
- The state can be null.
- If the state.remove() is called, then state.exists() will return false, and 
getOption will returm None.
- After that state.update(newState) is called, then state.exists() will return 
true, and getOption will return Some(...).
- None of the operations are thread-safe. This is to avoid memory barriers.

*Usage*
```
val stateFunc = (word: String, words: Iterator[String, runningCount: 
KeyedState[Long]) => {
val newCount = words.size + runningCount.getOption.getOrElse(0L)
runningCount.update(newCount)
   (word, newCount)
}

dataset // type is 
Dataset[String]
  .groupByKey[String](w => w)   // generates 
KeyValueGroupedDataset[String, String]
  .mapGroupsWithState[Long, (String, Long)](stateFunc)  // returns 
Dataset[(String, Long)]
```

## How was this patch tested?
New unit tests.

Author: Tathagata Das 

Closes #16758 from tdas/mapWithState.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aeb80348
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aeb80348
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aeb80348

Branch: refs/heads/master
Commit: aeb80348dd40c66b84bbc5cfe60d716fbce25acb
Parents: e33aaa2
Author: Tathagata Das 
Authored: Tue Feb 7 20:21:00 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 20:21:00 2017 -0800

--
 .../analysis/UnsupportedOperationChecker.scala  |  11 +-
 .../sql/catalyst/plans/logical/object.scala |  49 +++
 .../analysis/UnsupportedOperationsSuite.scala   |  24 +-
 .../FlatMapGroupsWithStateFunction.java |  38 +++
 .../function/MapGroupsWithStateFunction.java|  38 +++
 .../spark/sql/KeyValueGroupedDataset.scala  | 113 +++
 .../scala/org/apache/spark/sql/KeyedState.scala | 142 
 .../spark/sql/execution/SparkStrategies.scala   |  21 +-
 .../apache/spark/sql/execution/objects.scala|  22 ++
 .../streaming/IncrementalExecution.scala|  19 +-
 .../execution/streaming/KeyedStateImpl.scala|  80 +
 .../execution/streaming/ProgressReporter.scala  |   2 +-
 .../execution/streaming/StatefulAggregate.scala | 237 -
 .../state/HDFSBackedStateStoreProvider.scala|  19 ++
 .../execution/streaming/state/StateStore.scala  |   5 +
 .../sql/execution/streaming/state/package.scala |  11 +-
 .../execution/streaming/statefulOperators.scala | 323 ++
 .../org/apache/spark/sql/JavaDatasetSuite.java  |  32 ++
 .../sql/streaming/MapGroupsWithStateSuite.scala | 335 +++
 19 files changed, 1272 insertions(+), 249 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aeb80348/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Unsupport

spark git commit: [SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e642a07d5 -> 706d6c154


[SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()

## What changes were proposed in this pull request?

addBatch method in Sink trait is supposed to be a synchronous method to 
coordinate with the fault-tolerance design in StreamingExecution (being 
different with the compute() method in DStream)

We need to add more notes in the comments of this method to remind the 
developers

## How was this patch tested?

existing tests

Author: CodingCat 

Closes #16840 from CodingCat/SPARK-19499.

(cherry picked from commit d4cd975718716be11a42ce92a47c45be1a46bd60)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/706d6c15
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/706d6c15
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/706d6c15

Branch: refs/heads/branch-2.1
Commit: 706d6c154d2471c00253bf9b0c4e867752f841fe
Parents: e642a07
Author: CodingCat 
Authored: Tue Feb 7 20:25:18 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 20:25:25 2017 -0800

--
 .../scala/org/apache/spark/sql/execution/streaming/Sink.scala   | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/706d6c15/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
index 2571b59..d10cd30 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
@@ -31,8 +31,11 @@ trait Sink {
* this method is called more than once with the same batchId (which will 
happen in the case of
* failures), then `data` should only be added once.
*
-   * Note: You cannot apply any operators on `data` except consuming it (e.g., 
`collect/foreach`).
+   * Note 1: You cannot apply any operators on `data` except consuming it 
(e.g., `collect/foreach`).
* Otherwise, you may get a wrong result.
+   *
+   * Note 2: The method is supposed to be executed synchronously, i.e. the 
method should only return
+   * after data is consumed by sink successfully.
*/
   def addBatch(batchId: Long, data: DataFrame): Unit
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master aeb80348d -> d4cd97571


[SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()

## What changes were proposed in this pull request?

addBatch method in Sink trait is supposed to be a synchronous method to 
coordinate with the fault-tolerance design in StreamingExecution (being 
different with the compute() method in DStream)

We need to add more notes in the comments of this method to remind the 
developers

## How was this patch tested?

existing tests

Author: CodingCat 

Closes #16840 from CodingCat/SPARK-19499.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4cd9757
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4cd9757
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4cd9757

Branch: refs/heads/master
Commit: d4cd975718716be11a42ce92a47c45be1a46bd60
Parents: aeb8034
Author: CodingCat 
Authored: Tue Feb 7 20:25:18 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 20:25:18 2017 -0800

--
 .../scala/org/apache/spark/sql/execution/streaming/Sink.scala   | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d4cd9757/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
index 2571b59..d10cd30 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
@@ -31,8 +31,11 @@ trait Sink {
* this method is called more than once with the same batchId (which will 
happen in the case of
* failures), then `data` should only be added once.
*
-   * Note: You cannot apply any operators on `data` except consuming it (e.g., 
`collect/foreach`).
+   * Note 1: You cannot apply any operators on `data` except consuming it 
(e.g., `collect/foreach`).
* Otherwise, you may get a wrong result.
+   *
+   * Note 2: The method is supposed to be executed synchronously, i.e. the 
method should only return
+   * after data is consumed by sink successfully.
*/
   def addBatch(batchId: Long, data: DataFrame): Unit
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming doc

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 266c1e730 -> 5a0569ce6


[MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming 
doc

There is a typo in 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream
 , python example n1 uses `readStream()` instead of `readStream`

Just removed the parenthesis.

Author: manugarri 

Closes #16836 from manugarri/fix_kafka_python_doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5a0569ce
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5a0569ce
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5a0569ce

Branch: refs/heads/master
Commit: 5a0569ce693c635c5fa12b2de33ed3643ce888e3
Parents: 266c1e7
Author: manugarri 
Authored: Tue Feb 7 21:45:33 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 21:45:57 2017 -0800

--
 docs/structured-streaming-kafka-integration.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5a0569ce/docs/structured-streaming-kafka-integration.md
--
diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index 9b82e8e..8b2f51a 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -90,7 +90,7 @@ ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
 # Subscribe to 1 topic
 ds1 = spark
-  .readStream()
+  .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
   .option("subscribe", "topic1")
@@ -108,7 +108,7 @@ ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS 
STRING)")
 
 # Subscribe to a pattern
 ds3 = spark
-  .readStream()
+  .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
   .option("subscribePattern", "topic.*")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming doc

2017-02-07 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 706d6c154 -> 4d040297f


[MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming 
doc

There is a typo in 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream
 , python example n1 uses `readStream()` instead of `readStream`

Just removed the parenthesis.

Author: manugarri 

Closes #16836 from manugarri/fix_kafka_python_doc.

(cherry picked from commit 5a0569ce693c635c5fa12b2de33ed3643ce888e3)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d040297
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d040297
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d040297

Branch: refs/heads/branch-2.1
Commit: 4d040297f55243703463ea71d5302bb46ea0bf3f
Parents: 706d6c1
Author: manugarri 
Authored: Tue Feb 7 21:45:33 2017 -0800
Committer: Shixiong Zhu 
Committed: Tue Feb 7 21:46:41 2017 -0800

--
 docs/structured-streaming-kafka-integration.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4d040297/docs/structured-streaming-kafka-integration.md
--
diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index 2458bb5..208845f 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -90,7 +90,7 @@ ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
 # Subscribe to 1 topic
 ds1 = spark
-  .readStream()
+  .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
   .option("subscribe", "topic1")
@@ -108,7 +108,7 @@ ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS 
STRING)")
 
 # Subscribe to a pattern
 ds3 = spark
-  .readStream()
+  .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
   .option("subscribePattern", "topic.*")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for branch-2.1

2017-02-08 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 71b6eacf7 -> 502c927b8


[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for 
branch-2.1

This is a follow up PR for merging #16758 to spark 2.1 branch

## What changes were proposed in this pull request?

`mapGroupsWithState` is a new API for arbitrary stateful operations in 
Structured Streaming, similar to `DStream.mapWithState`

*Requirements*
- Users should be able to specify a function that can do the following
- Access the input row corresponding to a key
- Access the previous state corresponding to a key
- Optionally, update or remove the state
- Output any number of new rows (or none at all)

*Proposed API*
```
//  New methods on KeyValueGroupedDataset 
class KeyValueGroupedDataset[K, V] {
// Scala friendly
def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
KeyedState[S]) => U)
def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
Iterator[V], KeyedState[S]) => Iterator[U])
// Java friendly
   def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
   def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, 
V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
}

// --- New Java-friendly function classes ---
public interface MapGroupsWithStateFunction extends Serializable {
  R call(K key, Iterator values, state: KeyedState) throws Exception;
}
public interface FlatMapGroupsWithStateFunction extends 
Serializable {
  Iterator call(K key, Iterator values, state: KeyedState) throws 
Exception;
}

// -- Wrapper class for state data --
trait KeyedState[S] {
def exists(): Boolean
def get(): S// throws Exception is state does not 
exist
def getOption(): Option[S]
def update(newState: S): Unit
def remove(): Unit  // exists() will be false after this
}
```

Key Semantics of the State class
- The state can be null.
- If the state.remove() is called, then state.exists() will return false, and 
getOption will returm None.
- After that state.update(newState) is called, then state.exists() will return 
true, and getOption will return Some(...).
- None of the operations are thread-safe. This is to avoid memory barriers.

*Usage*
```
val stateFunc = (word: String, words: Iterator[String, runningCount: 
KeyedState[Long]) => {
val newCount = words.size + runningCount.getOption.getOrElse(0L)
runningCount.update(newCount)
   (word, newCount)
}

dataset // type is 
Dataset[String]
  .groupByKey[String](w => w)   // generates 
KeyValueGroupedDataset[String, String]
  .mapGroupsWithState[Long, (String, Long)](stateFunc)  // returns 
Dataset[(String, Long)]
```

## How was this patch tested?
New unit tests.

Author: Tathagata Das 

Closes #16850 from tdas/mapWithState-branch-2.1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/502c927b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/502c927b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/502c927b

Branch: refs/heads/branch-2.1
Commit: 502c927b8c8a99ef2adf4e6e1d7a6d9232d45ef5
Parents: 71b6eac
Author: Tathagata Das 
Authored: Wed Feb 8 11:33:59 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 8 11:33:59 2017 -0800

--
 .../analysis/UnsupportedOperationChecker.scala  |  11 +-
 .../sql/catalyst/plans/logical/object.scala |  49 +++
 .../analysis/UnsupportedOperationsSuite.scala   |  24 +-
 .../FlatMapGroupsWithStateFunction.java |  38 +++
 .../function/MapGroupsWithStateFunction.java|  38 +++
 .../spark/sql/KeyValueGroupedDataset.scala  | 113 +++
 .../scala/org/apache/spark/sql/KeyedState.scala | 142 
 .../spark/sql/execution/SparkStrategies.scala   |  21 +-
 .../apache/spark/sql/execution/objects.scala|  22 ++
 .../streaming/IncrementalExecution.scala|  19 +-
 .../execution/streaming/KeyedStateImpl.scala|  80 +
 .../execution/streaming/ProgressReporter.scala  |   2 +-
 .../execution/streaming/StatefulAggregate.scala | 237 -
 .../state/HDFSBackedStateStoreProvider.scala|  19 ++
 .../execution/streaming/state/StateStore.scala  |   5 +
 .../sql/execution/streaming/state/package.scala |  11 +-
 .../execution/streaming/statefulOperators.scala | 323 ++
 .../org/apache/spark/sql/JavaDatasetSuite.java  |  32 ++
 .../sql/streaming/MapGroupsWithStateSuite.scala | 335 +++
 19 files changed, 1272 insertions(+), 249 deletions(-)
--


http://git-wip-us.apache.org/rep

spark git commit: [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group

2017-02-12 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master bc0a0e639 -> 2bdbc8705


[SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not 
be in the same group

## What changes were proposed in this pull request?

In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and 
create a new consumer. In our current implementation, the first consumer and 
the second consumer would be in the same group (which leads to SPARK-19559), 
**_violating our intention of the two consumers not being in the same group._**

The cause is that, in our current implementation, the first consumer is created 
before `groupId` and `nextId` are initialized in the constructor. Then even if 
`groupId` and `nextId` are increased during the creation of that first 
consumer, `groupId` and `nextId` would still be initialized to default values 
in the constructor for the second consumer.

We should make sure that `groupId` and `nextId` are initialized before any 
consumer is created.

## How was this patch tested?

Ran 100 times of `KafkaSourceSuite`; all passed

Author: Liwei Lin 

Closes #16902 from lw-lin/SPARK-19564-.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2bdbc870
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2bdbc870
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2bdbc870

Branch: refs/heads/master
Commit: 2bdbc87052389ff69404347fbc69457132dbcafd
Parents: bc0a0e6
Author: Liwei Lin 
Authored: Sun Feb 12 23:00:22 2017 -0800
Committer: Shixiong Zhu 
Committed: Sun Feb 12 23:00:22 2017 -0800

--
 .../apache/spark/sql/kafka010/KafkaOffsetReader.scala| 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2bdbc870/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
index 6b2fb3c..2696d6f 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
@@ -65,6 +65,13 @@ private[kafka010] class KafkaOffsetReader(
   val execContext = ExecutionContext.fromExecutorService(kafkaReaderThread)
 
   /**
+   * Place [[groupId]] and [[nextId]] here so that they are initialized before 
any consumer is
+   * created -- see SPARK-19564.
+   */
+  private var groupId: String = null
+  private var nextId = 0
+
+  /**
* A KafkaConsumer used in the driver to query the latest Kafka offsets. 
This only queries the
* offsets and never commits them.
*/
@@ -76,10 +83,6 @@ private[kafka010] class KafkaOffsetReader(
   private val offsetFetchAttemptIntervalMs =
 readerOptions.getOrElse("fetchOffset.retryIntervalMs", "1000").toLong
 
-  private var groupId: String = null
-
-  private var nextId = 0
-
   private def nextGroupId(): String = {
 groupId = driverGroupIdPrefix + "-" + nextId
 nextId += 1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group

2017-02-12 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 06e77e009 -> fe4fcc570


[SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not 
be in the same group

## What changes were proposed in this pull request?

In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and 
create a new consumer. In our current implementation, the first consumer and 
the second consumer would be in the same group (which leads to SPARK-19559), 
**_violating our intention of the two consumers not being in the same group._**

The cause is that, in our current implementation, the first consumer is created 
before `groupId` and `nextId` are initialized in the constructor. Then even if 
`groupId` and `nextId` are increased during the creation of that first 
consumer, `groupId` and `nextId` would still be initialized to default values 
in the constructor for the second consumer.

We should make sure that `groupId` and `nextId` are initialized before any 
consumer is created.

## How was this patch tested?

Ran 100 times of `KafkaSourceSuite`; all passed

Author: Liwei Lin 

Closes #16902 from lw-lin/SPARK-19564-.

(cherry picked from commit 2bdbc87052389ff69404347fbc69457132dbcafd)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fe4fcc57
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fe4fcc57
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fe4fcc57

Branch: refs/heads/branch-2.1
Commit: fe4fcc5701cbd3f2e698e00f1cc7d49d5c7c702b
Parents: 06e77e0
Author: Liwei Lin 
Authored: Sun Feb 12 23:00:22 2017 -0800
Committer: Shixiong Zhu 
Committed: Sun Feb 12 23:00:30 2017 -0800

--
 .../apache/spark/sql/kafka010/KafkaOffsetReader.scala| 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fe4fcc57/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
--
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
index 6b2fb3c..2696d6f 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
@@ -65,6 +65,13 @@ private[kafka010] class KafkaOffsetReader(
   val execContext = ExecutionContext.fromExecutorService(kafkaReaderThread)
 
   /**
+   * Place [[groupId]] and [[nextId]] here so that they are initialized before 
any consumer is
+   * created -- see SPARK-19564.
+   */
+  private var groupId: String = null
+  private var nextId = 0
+
+  /**
* A KafkaConsumer used in the driver to query the latest Kafka offsets. 
This only queries the
* offsets and never commits them.
*/
@@ -76,10 +83,6 @@ private[kafka010] class KafkaOffsetReader(
   private val offsetFetchAttemptIntervalMs =
 readerOptions.getOrElse("fetchOffset.retryIntervalMs", "1000").toLong
 
-  private var groupId: String = null
-
-  private var nextId = 0
-
   private def nextGroupId(): String = {
 groupId = driverGroupIdPrefix + "-" + nextId
 nextId += 1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes

2017-02-13 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 3dbff9be0 -> 905fdf0c2


[SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader 
to load Netty generated classes

## What changes were proposed in this pull request?

Netty's `MessageToMessageEncoder` uses 
[Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62)
 to generate a matcher class and the implementation calls `Class.forName` to 
check if this class is already generated. If `MessageEncoder` or 
`MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause 
`ClassCircularityError`. This is because loading this Netty generated class 
will call `ExecutorClassLoader.findClass` to search this class, and 
`ExecutorClassLoader` will try to use RPC to load it and cause to load the 
non-exist matcher class again. JVM will report `ClassCircularityError` to 
prevent such infinite recursion.

# Why it only happens in Maven builds

It's because Maven and SBT have different class loader tree. The Maven build 
will set a URLClassLoader as the current context class loader to run the tests 
and expose this issue. The class loader tree is as following:

```
bootstrap class loader -- ... - REPL class loader  
ExecutorClassLoader
|
|
URLClasssLoader
```

The SBT build uses the bootstrap class loader directly and 
`ReplSuite.test("propagation of local properties")` is the first test in 
ReplSuite, which happens to load 
`io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher`
 into the bootstrap class loader (Note: in maven build, it's loaded into 
URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can 
be reproduced in SBT as well. Here are the produce steps:
- Enable `hadoop.caller.context.enabled`.
- Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
- Ignore `ReplSuite.test("propagation of local properties")`.
- Run `ReplSuite` using SBT.

This PR just creates a singleton MessageEncoder and MessageDecoder and makes 
sure they are created before switching to ExecutorClassLoader. TransportContext 
will be created when creating RpcEnv and that happens before creating 
ExecutorClassLoader.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16859 from zsxwing/SPARK-17714.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/905fdf0c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/905fdf0c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/905fdf0c

Branch: refs/heads/master
Commit: 905fdf0c243e1776c54c01a25b17878361400225
Parents: 3dbff9b
Author: Shixiong Zhu 
Authored: Mon Feb 13 12:03:36 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Feb 13 12:03:36 2017 -0800

--
 .../apache/spark/network/TransportContext.java  | 22 ++--
 .../spark/network/protocol/MessageDecoder.java  |  4 
 .../spark/network/protocol/MessageEncoder.java  |  4 
 .../network/server/TransportChannelHandler.java | 11 +-
 .../org/apache/spark/network/ProtocolSuite.java |  8 +++
 .../scala/org/apache/spark/util/Utils.scala | 16 --
 6 files changed, 38 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/905fdf0c/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
index 5b69e2b..37ba543 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
@@ -62,8 +62,20 @@ public class TransportContext {
   private final RpcHandler rpcHandler;
   private final boolean closeIdleConnections;
 
-  private final MessageEncoder encoder;
-  private final MessageDecoder decoder;
+  /**
+   * Force to create MessageEncoder and MessageDecoder so that we can make 
sure they will be created
+   * before switching the current context class loader to ExecutorClassLoader.
+   *
+   * Netty's MessageToMessageEncoder uses Javassist to generate a matcher 
class and the
+   * implementation calls "Class.forName" to check if this calls is already 
generated. If the
+   * following two objects are created in "ExecutorClassLoader.findClass", it 
will cause
+   * "ClassCircularityError". This is because loading this Netty generated 
class will call
+   * "ExecutorClass

spark git commit: [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes

2017-02-13 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 c5a7cb022 -> 328b22984


[SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader 
to load Netty generated classes

## What changes were proposed in this pull request?

Netty's `MessageToMessageEncoder` uses 
[Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62)
 to generate a matcher class and the implementation calls `Class.forName` to 
check if this class is already generated. If `MessageEncoder` or 
`MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause 
`ClassCircularityError`. This is because loading this Netty generated class 
will call `ExecutorClassLoader.findClass` to search this class, and 
`ExecutorClassLoader` will try to use RPC to load it and cause to load the 
non-exist matcher class again. JVM will report `ClassCircularityError` to 
prevent such infinite recursion.

# Why it only happens in Maven builds

It's because Maven and SBT have different class loader tree. The Maven build 
will set a URLClassLoader as the current context class loader to run the tests 
and expose this issue. The class loader tree is as following:

```
bootstrap class loader -- ... - REPL class loader  
ExecutorClassLoader
|
|
URLClasssLoader
```

The SBT build uses the bootstrap class loader directly and 
`ReplSuite.test("propagation of local properties")` is the first test in 
ReplSuite, which happens to load 
`io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher`
 into the bootstrap class loader (Note: in maven build, it's loaded into 
URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can 
be reproduced in SBT as well. Here are the produce steps:
- Enable `hadoop.caller.context.enabled`.
- Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
- Ignore `ReplSuite.test("propagation of local properties")`.
- Run `ReplSuite` using SBT.

This PR just creates a singleton MessageEncoder and MessageDecoder and makes 
sure they are created before switching to ExecutorClassLoader. TransportContext 
will be created when creating RpcEnv and that happens before creating 
ExecutorClassLoader.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16859 from zsxwing/SPARK-17714.

(cherry picked from commit 905fdf0c243e1776c54c01a25b17878361400225)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/328b2298
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/328b2298
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/328b2298

Branch: refs/heads/branch-2.1
Commit: 328b229840d6e87c7faf7ee3cd5bf66a905c9a7d
Parents: c5a7cb0
Author: Shixiong Zhu 
Authored: Mon Feb 13 12:03:36 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Feb 13 12:03:44 2017 -0800

--
 .../apache/spark/network/TransportContext.java  | 22 ++--
 .../spark/network/protocol/MessageDecoder.java  |  4 
 .../spark/network/protocol/MessageEncoder.java  |  4 
 .../network/server/TransportChannelHandler.java | 11 +-
 .../org/apache/spark/network/ProtocolSuite.java |  8 +++
 .../scala/org/apache/spark/util/Utils.scala | 16 --
 6 files changed, 38 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/328b2298/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
index 5b69e2b..37ba543 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
@@ -62,8 +62,20 @@ public class TransportContext {
   private final RpcHandler rpcHandler;
   private final boolean closeIdleConnections;
 
-  private final MessageEncoder encoder;
-  private final MessageDecoder decoder;
+  /**
+   * Force to create MessageEncoder and MessageDecoder so that we can make 
sure they will be created
+   * before switching the current context class loader to ExecutorClassLoader.
+   *
+   * Netty's MessageToMessageEncoder uses Javassist to generate a matcher 
class and the
+   * implementation calls "Class.forName" to check if this calls is already 
generated. If the
+   * following two objects are created in "ExecutorClassLoader.findClass", it 
will cause
+   * "ClassCi

spark git commit: [HOTFIX][SPARK-19542][SS]Fix the missing import in DataStreamReaderWriterSuite

2017-02-13 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 328b22984 -> 2968d8c06


[HOTFIX][SPARK-19542][SS]Fix the missing import in DataStreamReaderWriterSuite


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2968d8c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2968d8c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2968d8c0

Branch: refs/heads/branch-2.1
Commit: 2968d8c0666801fb6a363dfca3c5a85ee8a1cc0c
Parents: 328b229
Author: Shixiong Zhu 
Authored: Mon Feb 13 12:35:56 2017 -0800
Committer: Shixiong Zhu 
Committed: Mon Feb 13 12:36:00 2017 -0800

--
 .../spark/sql/streaming/test/DataStreamReaderWriterSuite.scala  | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2968d8c0/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
index f751948..4e63b04 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
@@ -22,6 +22,7 @@ import java.util.concurrent.TimeUnit
 
 import scala.concurrent.duration._
 
+import org.apache.hadoop.fs.Path
 import org.mockito.Mockito._
 import org.scalatest.{BeforeAndAfter, PrivateMethodTester}
 import org.scalatest.PrivateMethodTester.PrivateMethod


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19599][SS] Clean up HDFSMetadataLog

2017-02-15 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master f6c3bba22 -> 21b4ba2d6


[SPARK-19599][SS] Clean up HDFSMetadataLog

## What changes were proposed in this pull request?

SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog.

This PR includes the following changes:
- ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is 
another issue 
[HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that 
prevents us from removing the workaround codes.
- Remove unnecessary `writer: (T, OutputStream) => Unit` and just call 
`serialize` directly.
- Remove catching FileNotFoundException.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16932 from zsxwing/metadata-cleanup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21b4ba2d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21b4ba2d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21b4ba2d

Branch: refs/heads/master
Commit: 21b4ba2d6f21a9759af879471715c123073bd67a
Parents: f6c3bba
Author: Shixiong Zhu 
Authored: Wed Feb 15 16:21:43 2017 -0800
Committer: Shixiong Zhu 
Committed: Wed Feb 15 16:21:43 2017 -0800

--
 .../execution/streaming/HDFSMetadataLog.scala   | 39 +---
 .../execution/streaming/StreamExecution.scala   |  4 +-
 2 files changed, 19 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21b4ba2d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index bfdc2cb..3155ce0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -114,15 +114,18 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
   case ut: UninterruptibleThread =>
 // When using a local file system, "writeBatch" must be called on a
 // [[org.apache.spark.util.UninterruptibleThread]] so that 
interrupts can be disabled
-// while writing the batch file. This is because there is a 
potential dead-lock in
-// Hadoop "Shell.runCommand" before 2.5.0 (HADOOP-10622). If the 
thread running
-// "Shell.runCommand" is interrupted, then the thread can get 
deadlocked. In our case,
-// `writeBatch` creates a file using HDFS API and will call 
"Shell.runCommand" to set
-// the file permission if using the local file system, and can get 
deadlocked if the
-// stream execution thread is stopped by interrupt. Hence, we make 
sure that
-// "writeBatch" is called on [[UninterruptibleThread]] which 
allows us to disable
-// interrupts here. Also see SPARK-14131.
-ut.runUninterruptibly { writeBatch(batchId, metadata, serialize) }
+// while writing the batch file.
+//
+// This is because Hadoop "Shell.runCommand" swallows 
InterruptException (HADOOP-14084).
+// If the user tries to stop a query, and the thread running 
"Shell.runCommand" is
+// interrupted, then InterruptException will be dropped and the 
query will be still
+// running. (Note: `writeBatch` creates a file using HDFS APIs and 
will call
+// "Shell.runCommand" to set the file permission if using the 
local file system)
+//
+// Hence, we make sure that "writeBatch" is called on 
[[UninterruptibleThread]] which
+// allows us to disable interrupts here, in order to propagate the 
interrupt state
+// correctly. Also see SPARK-19599.
+ut.runUninterruptibly { writeBatch(batchId, metadata) }
   case _ =>
 throw new IllegalStateException(
   "HDFSMetadataLog.add() on a local file system must be executed 
on " +
@@ -132,20 +135,19 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 // For a distributed file system, such as HDFS or S3, if the network 
is broken, write
 // operations may just hang until timeout. We should enable interrupts 
to allow stopping
 // the query fast.
-writeBatch(batchId, metadata, serialize)
+writeBatch(batchId, metadata)
   }
   true
 }
   }
 
-  def writeTempBatch(metadata: T, w

< 1 2 3 4 5 6 7 8 >

501 - 600 of 788 matches

Mail list logo