spark git commit: [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

2016-11-16 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master 4ac9759f8 -> 95eb06bd7


[SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #15883 from yanboliang/spark-18438.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95eb06bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95eb06bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95eb06bd

Branch: refs/heads/master
Commit: 95eb06bd7d0f7110ef62c8d1cb6337c72b10d99f
Parents: 4ac9759
Author: Yanbo Liang 
Authored: Wed Nov 16 01:04:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 01:04:18 2016 -0800

--
 R/pkg/R/generics.R  |  2 +-
 R/pkg/R/mllib.R | 30 ++
 R/pkg/inst/tests/testthat/test_mllib.R  | 63 +---
 .../MultilayerPerceptronClassifierWrapper.scala | 61 ++-
 4 files changed, 96 insertions(+), 60 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/95eb06bd/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 7653ca7..499c7b2 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1373,7 +1373,7 @@ setGeneric("spark.logit", function(data, formula, ...) { 
standardGeneric("spark.
 
 #' @rdname spark.mlp
 #' @export
-setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") })
+setGeneric("spark.mlp", function(data, formula, ...) { 
standardGeneric("spark.mlp") })
 
 #' @rdname spark.naiveBayes
 #' @export

http://git-wip-us.apache.org/repos/asf/spark/blob/95eb06bd/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 1065b4b..265e64e 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -525,7 +525,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = 
"character"),
 #' @note spark.isoreg since 2.1.0
 setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol 
= NULL) {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -775,7 +775,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -858,6 +858,8 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #'   Multilayer Perceptron}
 #'
 #' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
 #' @param blockSize blockSize parameter.
 #' @param layers integer vector containing the number of nodes for each layer
 #' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "l-bfgs".
@@ -870,7 +872,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model.
 #' @rdname spark.mlp
-#' @aliases spark.mlp,SparkDataFrame-method
+#' @aliases spark.mlp,SparkDataFrame,formula-method
 #' @name spark.mlp
 #' @seealso \link{read.ml}
 #' @export
@@ -879,7 +881,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
 #'
 #' # fit a Multilayer Perceptron Classification Model
-#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 3), solver = "l-bfgs",
+#' model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 3), 
solver = "l-bfgs",
 #'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1,
 #'initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 
9, 9, 9))
 #'
@@ -896,9 +898,10 @@ setMethod("summary", signat

spark git commit: [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

2016-11-16 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 436ae201f -> 7b57e480d


[SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #15883 from yanboliang/spark-18438.

(cherry picked from commit 95eb06bd7d0f7110ef62c8d1cb6337c72b10d99f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b57e480
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b57e480
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b57e480

Branch: refs/heads/branch-2.1
Commit: 7b57e480d2f2c0695eb4036199cd0db52c6f2008
Parents: 436ae20
Author: Yanbo Liang 
Authored: Wed Nov 16 01:04:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 01:05:23 2016 -0800

--
 R/pkg/R/generics.R  |  2 +-
 R/pkg/R/mllib.R | 30 ++
 R/pkg/inst/tests/testthat/test_mllib.R  | 63 +---
 .../MultilayerPerceptronClassifierWrapper.scala | 61 ++-
 4 files changed, 96 insertions(+), 60 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7b57e480/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 7653ca7..499c7b2 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1373,7 +1373,7 @@ setGeneric("spark.logit", function(data, formula, ...) { 
standardGeneric("spark.
 
 #' @rdname spark.mlp
 #' @export
-setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") })
+setGeneric("spark.mlp", function(data, formula, ...) { 
standardGeneric("spark.mlp") })
 
 #' @rdname spark.naiveBayes
 #' @export

http://git-wip-us.apache.org/repos/asf/spark/blob/7b57e480/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 1065b4b..265e64e 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -525,7 +525,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = 
"character"),
 #' @note spark.isoreg since 2.1.0
 setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol 
= NULL) {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -775,7 +775,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -858,6 +858,8 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #'   Multilayer Perceptron}
 #'
 #' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
 #' @param blockSize blockSize parameter.
 #' @param layers integer vector containing the number of nodes for each layer
 #' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "l-bfgs".
@@ -870,7 +872,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model.
 #' @rdname spark.mlp
-#' @aliases spark.mlp,SparkDataFrame-method
+#' @aliases spark.mlp,SparkDataFrame,formula-method
 #' @name spark.mlp
 #' @seealso \link{read.ml}
 #' @export
@@ -879,7 +881,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
 #'
 #' # fit a Multilayer Perceptron Classification Model
-#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 3), solver = "l-bfgs",
+#' model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 3), 
solver = "l-bfgs",
 #'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1,
 #'initialWeigh

spark git commit: [SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive

2016-11-16 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 95eb06bd7 -> 74f5c2176


[SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive

## What changes were proposed in this pull request?

This PR aims to improve DataSource option keys to be more case-insensitive

DataSource partially use CaseInsensitiveMap in code-path. For example, the 
following fails to find url.

```scala
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("UrL", url1)
.option("dbtable", "TEST.SAVETEST")
.options(properties.asScala)
.save()
```

This PR makes DataSource options to use CaseInsensitiveMap internally and also 
makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` 
and `InsertIntoHadoopFsRelationCommand`. We can not pass them 
CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by 
calling newHadoopConfWithOptions(options) inside.

## How was this patch tested?

Pass the Jenkins test with newly added test cases.

Author: Dongjoon Hyun 

Closes #15884 from dongjoon-hyun/SPARK-18433.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74f5c217
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74f5c217
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74f5c217

Branch: refs/heads/master
Commit: 74f5c2176d8449e41f520febd38109edaf3f4172
Parents: 95eb06b
Author: Dongjoon Hyun 
Authored: Wed Nov 16 17:12:18 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Nov 16 17:12:18 2016 +0800

--
 .../spark/sql/catalyst/json/JSONOptions.scala   |  6 ++--
 .../sql/catalyst/util/CaseInsensitiveMap.scala  | 36 
 .../spark/sql/execution/command/ddl.scala   |  2 +-
 .../sql/execution/datasources/DataSource.scala  | 30 
 .../execution/datasources/csv/CSVOptions.scala  |  8 +++--
 .../spark/sql/execution/datasources/ddl.scala   | 18 --
 .../datasources/jdbc/JDBCOptions.scala  | 10 --
 .../datasources/parquet/ParquetOptions.scala|  6 +++-
 .../execution/streaming/FileStreamOptions.scala |  8 +++--
 .../datasources/csv/CSVInferSchemaSuite.scala   |  5 +++
 .../execution/datasources/json/JsonSuite.scala  | 19 +--
 .../datasources/parquet/ParquetIOSuite.scala|  7 
 .../apache/spark/sql/jdbc/JDBCWriteSuite.scala  |  9 +
 .../sql/streaming/FileStreamSourceSuite.scala   |  5 +++
 .../spark/sql/hive/HiveExternalCatalog.scala|  2 +-
 .../apache/spark/sql/hive/orc/OrcOptions.scala  |  6 +++-
 .../spark/sql/hive/orc/OrcSourceSuite.scala |  4 +++
 .../apache/spark/sql/hive/parquetSuites.scala   |  1 +
 18 files changed, 133 insertions(+), 49 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/74f5c217/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
index c459706..38e191b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
@@ -23,7 +23,7 @@ import com.fasterxml.jackson.core.{JsonFactory, JsonParser}
 import org.apache.commons.lang3.time.FastDateFormat
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.sql.catalyst.util.{CompressionCodecs, ParseModes}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, 
CompressionCodecs, ParseModes}
 
 /**
  * Options for parsing JSON data into Spark SQL rows.
@@ -31,9 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CompressionCodecs, ParseModes}
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: Map[String, String])
+@transient private val parameters: CaseInsensitiveMap)
   extends Logging with Serializable  {
 
+  def this(parameters: Map[String, String]) = this(new 
CaseInsensitiveMap(parameters))
+
   val samplingRatio =
 parameters.get("samplingRatio").map(_.toDouble).getOrElse(1.0)
   val primitivesAsString =

http://git-wip-us.apache.org/repos/asf/spark/blob/74f5c217/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
new file mode 100644
index 000..a7f7a8a
--- /dev/null
+++ 
b/sql/catalys

spark git commit: [SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive

2016-11-16 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 7b57e480d -> b18c5a9b9


[SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive

## What changes were proposed in this pull request?

This PR aims to improve DataSource option keys to be more case-insensitive

DataSource partially use CaseInsensitiveMap in code-path. For example, the 
following fails to find url.

```scala
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("UrL", url1)
.option("dbtable", "TEST.SAVETEST")
.options(properties.asScala)
.save()
```

This PR makes DataSource options to use CaseInsensitiveMap internally and also 
makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` 
and `InsertIntoHadoopFsRelationCommand`. We can not pass them 
CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by 
calling newHadoopConfWithOptions(options) inside.

## How was this patch tested?

Pass the Jenkins test with newly added test cases.

Author: Dongjoon Hyun 

Closes #15884 from dongjoon-hyun/SPARK-18433.

(cherry picked from commit 74f5c2176d8449e41f520febd38109edaf3f4172)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b18c5a9b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b18c5a9b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b18c5a9b

Branch: refs/heads/branch-2.1
Commit: b18c5a9b97981742b6ee1c928705d9af0dc85e70
Parents: 7b57e48
Author: Dongjoon Hyun 
Authored: Wed Nov 16 17:12:18 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Nov 16 17:12:37 2016 +0800

--
 .../spark/sql/catalyst/json/JSONOptions.scala   |  6 ++--
 .../sql/catalyst/util/CaseInsensitiveMap.scala  | 36 
 .../spark/sql/execution/command/ddl.scala   |  2 +-
 .../sql/execution/datasources/DataSource.scala  | 30 
 .../execution/datasources/csv/CSVOptions.scala  |  8 +++--
 .../spark/sql/execution/datasources/ddl.scala   | 18 --
 .../datasources/jdbc/JDBCOptions.scala  | 10 --
 .../datasources/parquet/ParquetOptions.scala|  6 +++-
 .../execution/streaming/FileStreamOptions.scala |  8 +++--
 .../datasources/csv/CSVInferSchemaSuite.scala   |  5 +++
 .../execution/datasources/json/JsonSuite.scala  | 19 +--
 .../datasources/parquet/ParquetIOSuite.scala|  7 
 .../apache/spark/sql/jdbc/JDBCWriteSuite.scala  |  9 +
 .../sql/streaming/FileStreamSourceSuite.scala   |  5 +++
 .../spark/sql/hive/HiveExternalCatalog.scala|  2 +-
 .../apache/spark/sql/hive/orc/OrcOptions.scala  |  6 +++-
 .../spark/sql/hive/orc/OrcSourceSuite.scala |  4 +++
 .../apache/spark/sql/hive/parquetSuites.scala   |  1 +
 18 files changed, 133 insertions(+), 49 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b18c5a9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
index c459706..38e191b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
@@ -23,7 +23,7 @@ import com.fasterxml.jackson.core.{JsonFactory, JsonParser}
 import org.apache.commons.lang3.time.FastDateFormat
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.sql.catalyst.util.{CompressionCodecs, ParseModes}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, 
CompressionCodecs, ParseModes}
 
 /**
  * Options for parsing JSON data into Spark SQL rows.
@@ -31,9 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CompressionCodecs, ParseModes}
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: Map[String, String])
+@transient private val parameters: CaseInsensitiveMap)
   extends Logging with Serializable  {
 
+  def this(parameters: Map[String, String]) = this(new 
CaseInsensitiveMap(parameters))
+
   val samplingRatio =
 parameters.get("samplingRatio").map(_.toDouble).getOrElse(1.0)
   val primitivesAsString =

http://git-wip-us.apache.org/repos/asf/spark/blob/b18c5a9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/

spark git commit: [DOC][MINOR] Kafka doc: breakup into lines

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 74f5c2176 -> 3e01f1282


[DOC][MINOR] Kafka doc: breakup into lines

## Before

![before](https://cloud.githubusercontent.com/assets/15843379/20340231/99b039fe-ac1b-11e6-9ba9-b44582427459.png)

## After

![after](https://cloud.githubusercontent.com/assets/15843379/20340236/9d5796e2-ac1b-11e6-92bb-6da40ba1a383.png)

Author: Liwei Lin 

Closes #15903 from lw-lin/kafka-doc-lines.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3e01f128
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3e01f128
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3e01f128

Branch: refs/heads/master
Commit: 3e01f128284993f39463c0ccd902b774f57cce76
Parents: 74f5c21
Author: Liwei Lin 
Authored: Wed Nov 16 09:51:59 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 09:51:59 2016 +

--
 docs/structured-streaming-kafka-integration.md | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3e01f128/docs/structured-streaming-kafka-integration.md
--
diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index c4c9fb3..2458bb5 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -240,6 +240,7 @@ Kafka's own configurations can be set via 
`DataStreamReader.option` with `kafka.
 [Kafka consumer config 
docs](http://kafka.apache.org/documentation.html#newconsumerconfigs).
 
 Note that the following Kafka params cannot be set and the Kafka source will 
throw an exception:
+
 - **group.id**: Kafka source will create a unique group id for each query 
automatically.
 - **auto.offset.reset**: Set the source option `startingOffsets` to specify
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [DOC][MINOR] Kafka doc: breakup into lines

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 b18c5a9b9 -> 4567db9da


[DOC][MINOR] Kafka doc: breakup into lines

## Before

![before](https://cloud.githubusercontent.com/assets/15843379/20340231/99b039fe-ac1b-11e6-9ba9-b44582427459.png)

## After

![after](https://cloud.githubusercontent.com/assets/15843379/20340236/9d5796e2-ac1b-11e6-92bb-6da40ba1a383.png)

Author: Liwei Lin 

Closes #15903 from lw-lin/kafka-doc-lines.

(cherry picked from commit 3e01f128284993f39463c0ccd902b774f57cce76)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4567db9d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4567db9d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4567db9d

Branch: refs/heads/branch-2.1
Commit: 4567db9da47f0830e952614393d6105f4f5587a0
Parents: b18c5a9
Author: Liwei Lin 
Authored: Wed Nov 16 09:51:59 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 09:52:09 2016 +

--
 docs/structured-streaming-kafka-integration.md | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4567db9d/docs/structured-streaming-kafka-integration.md
--
diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index c4c9fb3..2458bb5 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -240,6 +240,7 @@ Kafka's own configurations can be set via 
`DataStreamReader.option` with `kafka.
 [Kafka consumer config 
docs](http://kafka.apache.org/documentation.html#newconsumerconfigs).
 
 Note that the following Kafka params cannot be set and the Kafka source will 
throw an exception:
+
 - **group.id**: Kafka source will create a unique group id for each query 
automatically.
 - **auto.offset.reset**: Set the source option `startingOffsets` to specify
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 3e01f1282 -> 43a26899e


[SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

## What changes were proposed in this pull request?

Avoid NPE in KinesisRecordProcessor when shutdown happens without successful 
init

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #15882 from srowen/SPARK-18400.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43a26899
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43a26899
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43a26899

Branch: refs/heads/master
Commit: 43a26899e5dd2364297eaf8985bd68367e4735a7
Parents: 3e01f12
Author: Sean Owen 
Authored: Wed Nov 16 10:16:36 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:16:36 2016 +

--
 .../kinesis/KinesisRecordProcessor.scala| 42 +++-
 1 file changed, 23 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/43a26899/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
--
diff --git 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
index 80e0cce..a0ccd08 100644
--- 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
+++ 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
@@ -27,7 +27,6 @@ import 
com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
 import com.amazonaws.services.kinesis.model.Record
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.streaming.Duration
 
 /**
  * Kinesis-specific implementation of the Kinesis Client Library (KCL) 
IRecordProcessor.
@@ -102,27 +101,32 @@ private[kinesis] class 
KinesisRecordProcessor[T](receiver: KinesisReceiver[T], w
* @param checkpointer used to perform a Kinesis checkpoint for 
ShutdownReason.TERMINATE
* @param reason for shutdown (ShutdownReason.TERMINATE or 
ShutdownReason.ZOMBIE)
*/
-  override def shutdown(checkpointer: IRecordProcessorCheckpointer, reason: 
ShutdownReason) {
+  override def shutdown(
+  checkpointer: IRecordProcessorCheckpointer,
+  reason: ShutdownReason): Unit = {
 logInfo(s"Shutdown:  Shutting down workerId $workerId with reason $reason")
-reason match {
-  /*
-   * TERMINATE Use Case.  Checkpoint.
-   * Checkpoint to indicate that all records from the shard have been 
drained and processed.
-   * It's now OK to read from the new shards that resulted from a 
resharding event.
-   */
-  case ShutdownReason.TERMINATE =>
-receiver.removeCheckpointer(shardId, checkpointer)
+// null if not initialized before shutdown:
+if (shardId == null) {
+  logWarning(s"No shardId for workerId $workerId?")
+} else {
+  reason match {
+/*
+ * TERMINATE Use Case.  Checkpoint.
+ * Checkpoint to indicate that all records from the shard have been 
drained and processed.
+ * It's now OK to read from the new shards that resulted from a 
resharding event.
+ */
+case ShutdownReason.TERMINATE => receiver.removeCheckpointer(shardId, 
checkpointer)
 
-  /*
-   * ZOMBIE Use Case or Unknown reason.  NoOp.
-   * No checkpoint because other workers may have taken over and already 
started processing
-   *the same records.
-   * This may lead to records being processed more than once.
-   */
-  case _ =>
-receiver.removeCheckpointer(shardId, null) // return null so that we 
don't checkpoint
+/*
+ * ZOMBIE Use Case or Unknown reason.  NoOp.
+ * No checkpoint because other workers may have taken over and already 
started processing
+ *the same records.
+ * This may lead to records being processed more than once.
+ * Return null so that we don't checkpoint
+ */
+case _ => receiver.removeCheckpointer(shardId, null)
+  }
 }
-
   }
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 4567db9da -> a94659cee


[SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

## What changes were proposed in this pull request?

Avoid NPE in KinesisRecordProcessor when shutdown happens without successful 
init

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #15882 from srowen/SPARK-18400.

(cherry picked from commit 43a26899e5dd2364297eaf8985bd68367e4735a7)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a94659ce
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a94659ce
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a94659ce

Branch: refs/heads/branch-2.1
Commit: a94659ceeb339a93f72bad3ed059bd2cdfca4df9
Parents: 4567db9
Author: Sean Owen 
Authored: Wed Nov 16 10:16:36 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:16:45 2016 +

--
 .../kinesis/KinesisRecordProcessor.scala| 42 +++-
 1 file changed, 23 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a94659ce/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
--
diff --git 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
index 80e0cce..a0ccd08 100644
--- 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
+++ 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
@@ -27,7 +27,6 @@ import 
com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
 import com.amazonaws.services.kinesis.model.Record
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.streaming.Duration
 
 /**
  * Kinesis-specific implementation of the Kinesis Client Library (KCL) 
IRecordProcessor.
@@ -102,27 +101,32 @@ private[kinesis] class 
KinesisRecordProcessor[T](receiver: KinesisReceiver[T], w
* @param checkpointer used to perform a Kinesis checkpoint for 
ShutdownReason.TERMINATE
* @param reason for shutdown (ShutdownReason.TERMINATE or 
ShutdownReason.ZOMBIE)
*/
-  override def shutdown(checkpointer: IRecordProcessorCheckpointer, reason: 
ShutdownReason) {
+  override def shutdown(
+  checkpointer: IRecordProcessorCheckpointer,
+  reason: ShutdownReason): Unit = {
 logInfo(s"Shutdown:  Shutting down workerId $workerId with reason $reason")
-reason match {
-  /*
-   * TERMINATE Use Case.  Checkpoint.
-   * Checkpoint to indicate that all records from the shard have been 
drained and processed.
-   * It's now OK to read from the new shards that resulted from a 
resharding event.
-   */
-  case ShutdownReason.TERMINATE =>
-receiver.removeCheckpointer(shardId, checkpointer)
+// null if not initialized before shutdown:
+if (shardId == null) {
+  logWarning(s"No shardId for workerId $workerId?")
+} else {
+  reason match {
+/*
+ * TERMINATE Use Case.  Checkpoint.
+ * Checkpoint to indicate that all records from the shard have been 
drained and processed.
+ * It's now OK to read from the new shards that resulted from a 
resharding event.
+ */
+case ShutdownReason.TERMINATE => receiver.removeCheckpointer(shardId, 
checkpointer)
 
-  /*
-   * ZOMBIE Use Case or Unknown reason.  NoOp.
-   * No checkpoint because other workers may have taken over and already 
started processing
-   *the same records.
-   * This may lead to records being processed more than once.
-   */
-  case _ =>
-receiver.removeCheckpointer(shardId, null) // return null so that we 
don't checkpoint
+/*
+ * ZOMBIE Use Case or Unknown reason.  NoOp.
+ * No checkpoint because other workers may have taken over and already 
started processing
+ *the same records.
+ * This may lead to records being processed more than once.
+ * Return null so that we don't checkpoint
+ */
+case _ => receiver.removeCheckpointer(shardId, null)
+  }
 }
-
   }
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 8d55886aa -> 4f3f09696


[SPARK-18400][STREAMING] NPE when resharding Kinesis Stream

## What changes were proposed in this pull request?

Avoid NPE in KinesisRecordProcessor when shutdown happens without successful 
init

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #15882 from srowen/SPARK-18400.

(cherry picked from commit 43a26899e5dd2364297eaf8985bd68367e4735a7)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f3f0969
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f3f0969
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f3f0969

Branch: refs/heads/branch-2.0
Commit: 4f3f09696ea12b631e1db8d00baf363292c5f3e3
Parents: 8d55886
Author: Sean Owen 
Authored: Wed Nov 16 10:16:36 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:16:58 2016 +

--
 .../kinesis/KinesisRecordProcessor.scala| 42 +++-
 1 file changed, 23 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4f3f0969/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
--
diff --git 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
index 80e0cce..a0ccd08 100644
--- 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
+++ 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala
@@ -27,7 +27,6 @@ import 
com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
 import com.amazonaws.services.kinesis.model.Record
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.streaming.Duration
 
 /**
  * Kinesis-specific implementation of the Kinesis Client Library (KCL) 
IRecordProcessor.
@@ -102,27 +101,32 @@ private[kinesis] class 
KinesisRecordProcessor[T](receiver: KinesisReceiver[T], w
* @param checkpointer used to perform a Kinesis checkpoint for 
ShutdownReason.TERMINATE
* @param reason for shutdown (ShutdownReason.TERMINATE or 
ShutdownReason.ZOMBIE)
*/
-  override def shutdown(checkpointer: IRecordProcessorCheckpointer, reason: 
ShutdownReason) {
+  override def shutdown(
+  checkpointer: IRecordProcessorCheckpointer,
+  reason: ShutdownReason): Unit = {
 logInfo(s"Shutdown:  Shutting down workerId $workerId with reason $reason")
-reason match {
-  /*
-   * TERMINATE Use Case.  Checkpoint.
-   * Checkpoint to indicate that all records from the shard have been 
drained and processed.
-   * It's now OK to read from the new shards that resulted from a 
resharding event.
-   */
-  case ShutdownReason.TERMINATE =>
-receiver.removeCheckpointer(shardId, checkpointer)
+// null if not initialized before shutdown:
+if (shardId == null) {
+  logWarning(s"No shardId for workerId $workerId?")
+} else {
+  reason match {
+/*
+ * TERMINATE Use Case.  Checkpoint.
+ * Checkpoint to indicate that all records from the shard have been 
drained and processed.
+ * It's now OK to read from the new shards that resulted from a 
resharding event.
+ */
+case ShutdownReason.TERMINATE => receiver.removeCheckpointer(shardId, 
checkpointer)
 
-  /*
-   * ZOMBIE Use Case or Unknown reason.  NoOp.
-   * No checkpoint because other workers may have taken over and already 
started processing
-   *the same records.
-   * This may lead to records being processed more than once.
-   */
-  case _ =>
-receiver.removeCheckpointer(shardId, null) // return null so that we 
don't checkpoint
+/*
+ * ZOMBIE Use Case or Unknown reason.  NoOp.
+ * No checkpoint because other workers may have taken over and already 
started processing
+ *the same records.
+ * This may lead to records being processed more than once.
+ * Return null so that we don't checkpoint
+ */
+case _ => receiver.removeCheckpointer(shardId, null)
+  }
 }
-
   }
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18410][STREAMING] Add structured kafka example

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 43a26899e -> e6145772e


[SPARK-18410][STREAMING] Add structured kafka example

## What changes were proposed in this pull request?

This PR provides structured kafka wordcount examples

## How was this patch tested?

Author: uncleGen 

Closes #15849 from uncleGen/SPARK-18410.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e6145772
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e6145772
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e6145772

Branch: refs/heads/master
Commit: e6145772eda8d6d3727605e80a7c2f182c801003
Parents: 43a2689
Author: uncleGen 
Authored: Wed Nov 16 10:19:10 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:19:10 2016 +

--
 .../streaming/JavaStructuredKafkaWordCount.java | 96 
 .../sql/streaming/structured_kafka_wordcount.py | 90 ++
 .../streaming/StructuredKafkaWordCount.scala| 85 +
 3 files changed, 271 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e6145772/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
new file mode 100644
index 000..0f45cfe
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.sql.streaming;
+
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.streaming.StreamingQuery;
+
+import java.util.Arrays;
+import java.util.Iterator;
+
+/**
+ * Consumes messages from one or more topics in Kafka and does wordcount.
+ * Usage: JavaStructuredKafkaWordCount   

+ *The Kafka "bootstrap.servers" configuration. A
+ *   comma-separated list of host:port.
+ *There are three kinds of type, i.e. 'assign', 
'subscribe',
+ *   'subscribePattern'.
+ *   |-  Specific TopicPartitions to consume. Json string
+ *   |  {"topicA":[0,1],"topicB":[2,4]}.
+ *   |-  The topic list to subscribe. A comma-separated list of
+ *   |  topics.
+ *   |-  The pattern used to subscribe to topic(s).
+ *   |  Java regex string.
+ *   |- Only one of "assign, "subscribe" or "subscribePattern" options can be
+ *   |  specified for Kafka source.
+ *Different value format depends on the value of 'subscribe-type'.
+ *
+ * Example:
+ *`$ bin/run-example \
+ *  sql.streaming.JavaStructuredKafkaWordCount host1:port1,host2:port2 \
+ *  subscribe topic1,topic2`
+ */
+public final class JavaStructuredKafkaWordCount {
+
+  public static void main(String[] args) throws Exception {
+if (args.length < 3) {
+  System.err.println("Usage: JavaStructuredKafkaWordCount 
 " +
+" ");
+  System.exit(1);
+}
+
+String bootstrapServers = args[0];
+String subscribeType = args[1];
+String topics = args[2];
+
+SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaStructuredKafkaWordCount")
+  .getOrCreate();
+
+// Create DataSet representing the stream of input lines from kafka
+Dataset lines = spark
+  .readStream()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", bootstrapServers)
+  .option(subscribeType, topics)
+  .load()
+  .selectExpr("CAST(value AS STRING)")
+  .as(Encoders.STRING());
+
+// Generate running word count
+Dataset wordCounts = lines.flatMap(new FlatMapFunction() {
+  @Override
+  public Iterator call(String x) {
+return Arrays.asList(x.split(" ")).iterator();
+

spark git commit: [SPARK-18410][STREAMING] Add structured kafka example

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 a94659cee -> 6b2301b89


[SPARK-18410][STREAMING] Add structured kafka example

## What changes were proposed in this pull request?

This PR provides structured kafka wordcount examples

## How was this patch tested?

Author: uncleGen 

Closes #15849 from uncleGen/SPARK-18410.

(cherry picked from commit e6145772eda8d6d3727605e80a7c2f182c801003)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b2301b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b2301b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b2301b8

Branch: refs/heads/branch-2.1
Commit: 6b2301b89bf5a89bd2b8a3d85c9c05a490be2ddb
Parents: a94659c
Author: uncleGen 
Authored: Wed Nov 16 10:19:10 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:19:18 2016 +

--
 .../streaming/JavaStructuredKafkaWordCount.java | 96 
 .../sql/streaming/structured_kafka_wordcount.py | 90 ++
 .../streaming/StructuredKafkaWordCount.scala| 85 +
 3 files changed, 271 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b2301b8/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
new file mode 100644
index 000..0f45cfe
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredKafkaWordCount.java
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.sql.streaming;
+
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.streaming.StreamingQuery;
+
+import java.util.Arrays;
+import java.util.Iterator;
+
+/**
+ * Consumes messages from one or more topics in Kafka and does wordcount.
+ * Usage: JavaStructuredKafkaWordCount   

+ *The Kafka "bootstrap.servers" configuration. A
+ *   comma-separated list of host:port.
+ *There are three kinds of type, i.e. 'assign', 
'subscribe',
+ *   'subscribePattern'.
+ *   |-  Specific TopicPartitions to consume. Json string
+ *   |  {"topicA":[0,1],"topicB":[2,4]}.
+ *   |-  The topic list to subscribe. A comma-separated list of
+ *   |  topics.
+ *   |-  The pattern used to subscribe to topic(s).
+ *   |  Java regex string.
+ *   |- Only one of "assign, "subscribe" or "subscribePattern" options can be
+ *   |  specified for Kafka source.
+ *Different value format depends on the value of 'subscribe-type'.
+ *
+ * Example:
+ *`$ bin/run-example \
+ *  sql.streaming.JavaStructuredKafkaWordCount host1:port1,host2:port2 \
+ *  subscribe topic1,topic2`
+ */
+public final class JavaStructuredKafkaWordCount {
+
+  public static void main(String[] args) throws Exception {
+if (args.length < 3) {
+  System.err.println("Usage: JavaStructuredKafkaWordCount 
 " +
+" ");
+  System.exit(1);
+}
+
+String bootstrapServers = args[0];
+String subscribeType = args[1];
+String topics = args[2];
+
+SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaStructuredKafkaWordCount")
+  .getOrCreate();
+
+// Create DataSet representing the stream of input lines from kafka
+Dataset lines = spark
+  .readStream()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", bootstrapServers)
+  .option(subscribeType, topics)
+  .load()
+  .selectExpr("CAST(value AS STRING)")
+  .as(Encoders.STRING());
+
+// Generate running word count
+Dataset wordCounts = lines.flatMap(new FlatMapFunction() {
+  @Ov

spark git commit: [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master e6145772e -> 241e04bc0


[MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 
'sql-programming-guide' documentation

## What changes were proposed in this pull request?

Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' 
documentation.

## How was this patch tested?
Manually.

Author: Weiqing Yang 

Closes #15886 from weiqingy/fixTypo.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/241e04bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/241e04bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/241e04bc

Branch: refs/heads/master
Commit: 241e04bc03efb1379622c0c84299e617512973ac
Parents: e614577
Author: Weiqing Yang 
Authored: Wed Nov 16 10:34:56 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:34:56 2016 +

--
 docs/configuration.md | 2 +-
 docs/monitoring.md| 2 +-
 docs/sql-programming-guide.md | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/241e04bc/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index ea99592..c021a37 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1951,7 +1951,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
   spark.r.heartBeatInterval
   100
   
-Interval for heartbeats sents from SparkR backend to R process to prevent 
connection timeout.
+Interval for heartbeats sent from SparkR backend to R process to prevent 
connection timeout.
   
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/241e04bc/docs/monitoring.md
--
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 5bc5e18..2eef456 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -41,7 +41,7 @@ directory must be supplied in the 
`spark.history.fs.logDirectory` configuration
 and should contain sub-directories that each represents an application's event 
logs.
 
 The spark jobs themselves must be configured to log events, and to log them to 
the same shared,
-writeable directory. For example, if the server was configured with a log 
directory of
+writable directory. For example, if the server was configured with a log 
directory of
 `hdfs://namenode/shared/spark-logs`, then the client-side options would be:
 
 ```

http://git-wip-us.apache.org/repos/asf/spark/blob/241e04bc/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index b9be7a7..ba3e55f 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -222,9 +222,9 @@ The `sql` function enables applications to run SQL queries 
programmatically and
 
 ## Global Temporary View
 
-Temporay views in Spark SQL are session-scoped and will disappear if the 
session that creates it
+Temporary views in Spark SQL are session-scoped and will disappear if the 
session that creates it
 terminates. If you want to have a temporary view that is shared among all 
sessions and keep alive
-until the Spark application terminiates, you can create a global temporary 
view. Global temporary
+until the Spark application terminates, you can create a global temporary 
view. Global temporary
 view is tied to a system preserved database `global_temp`, and we must use the 
qualified name to
 refer it, e.g. `SELECT * FROM global_temp.view1`.
 
@@ -1029,7 +1029,7 @@ following command:
 bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars 
postgresql-9.4.1207.jar
 {% endhighlight %}
 
-Tables from the remote database can be loaded as a DataFrame or Spark SQL 
Temporary table using
+Tables from the remote database can be loaded as a DataFrame or Spark SQL 
temporary view using
 the Data Sources API. Users can specify the JDBC connection properties in the 
data source options.
 user and password are normally provided as 
connection properties for
 logging into the data sources. In addition to the connection properties, Spark 
also supports


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 6b2301b89 -> 820847008


[MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 
'sql-programming-guide' documentation

## What changes were proposed in this pull request?

Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' 
documentation.

## How was this patch tested?
Manually.

Author: Weiqing Yang 

Closes #15886 from weiqingy/fixTypo.

(cherry picked from commit 241e04bc03efb1379622c0c84299e617512973ac)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/82084700
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/82084700
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/82084700

Branch: refs/heads/branch-2.1
Commit: 8208470084153f0be6818f66309f63dcdcb16519
Parents: 6b2301b
Author: Weiqing Yang 
Authored: Wed Nov 16 10:34:56 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:35:05 2016 +

--
 docs/configuration.md | 2 +-
 docs/monitoring.md| 2 +-
 docs/sql-programming-guide.md | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/82084700/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index d0acd94..e0c6613 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1916,7 +1916,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
   spark.r.heartBeatInterval
   100
   
-Interval for heartbeats sents from SparkR backend to R process to prevent 
connection timeout.
+Interval for heartbeats sent from SparkR backend to R process to prevent 
connection timeout.
   
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/82084700/docs/monitoring.md
--
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 5bc5e18..2eef456 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -41,7 +41,7 @@ directory must be supplied in the 
`spark.history.fs.logDirectory` configuration
 and should contain sub-directories that each represents an application's event 
logs.
 
 The spark jobs themselves must be configured to log events, and to log them to 
the same shared,
-writeable directory. For example, if the server was configured with a log 
directory of
+writable directory. For example, if the server was configured with a log 
directory of
 `hdfs://namenode/shared/spark-logs`, then the client-side options would be:
 
 ```

http://git-wip-us.apache.org/repos/asf/spark/blob/82084700/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index b9be7a7..ba3e55f 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -222,9 +222,9 @@ The `sql` function enables applications to run SQL queries 
programmatically and
 
 ## Global Temporary View
 
-Temporay views in Spark SQL are session-scoped and will disappear if the 
session that creates it
+Temporary views in Spark SQL are session-scoped and will disappear if the 
session that creates it
 terminates. If you want to have a temporary view that is shared among all 
sessions and keep alive
-until the Spark application terminiates, you can create a global temporary 
view. Global temporary
+until the Spark application terminates, you can create a global temporary 
view. Global temporary
 view is tied to a system preserved database `global_temp`, and we must use the 
qualified name to
 refer it, e.g. `SELECT * FROM global_temp.view1`.
 
@@ -1029,7 +1029,7 @@ following command:
 bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars 
postgresql-9.4.1207.jar
 {% endhighlight %}
 
-Tables from the remote database can be loaded as a DataFrame or Spark SQL 
Temporary table using
+Tables from the remote database can be loaded as a DataFrame or Spark SQL 
temporary view using
 the Data Sources API. Users can specify the JDBC connection properties in the 
data source options.
 user and password are normally provided as 
connection properties for
 logging into the data sources. In addition to the connection properties, Spark 
also supports


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18434][ML] Add missing ParamValidations for ML algos

2016-11-16 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master 241e04bc0 -> c68f1a38a


[SPARK-18434][ML] Add missing ParamValidations for ML algos

## What changes were proposed in this pull request?
Add missing ParamValidations for ML algos
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15881 from zhengruifeng/arg_checking.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c68f1a38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c68f1a38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c68f1a38

Branch: refs/heads/master
Commit: c68f1a38af67957ee28889667193da8f64bb4342
Parents: 241e04b
Author: Zheng RuiFeng 
Authored: Wed Nov 16 02:46:27 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 02:46:27 2016 -0800

--
 .../main/scala/org/apache/spark/ml/feature/IDF.scala   |  3 ++-
 .../main/scala/org/apache/spark/ml/feature/PCA.scala   |  3 ++-
 .../scala/org/apache/spark/ml/feature/Word2Vec.scala   | 13 -
 .../spark/ml/regression/IsotonicRegression.scala   |  3 ++-
 .../apache/spark/ml/regression/LinearRegression.scala  |  6 +-
 .../scala/org/apache/spark/ml/tree/treeParams.scala|  4 +++-
 6 files changed, 22 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
index 6386dd8..46a0730 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
@@ -44,7 +44,8 @@ private[feature] trait IDFBase extends Params with 
HasInputCol with HasOutputCol
* @group param
*/
   final val minDocFreq = new IntParam(
-this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering")
+this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering" +
+  " (>= 0)", ParamValidators.gtEq(0))
 
   setDefault(minDocFreq -> 0)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
index 6b91348..444006f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
@@ -44,7 +44,8 @@ private[feature] trait PCAParams extends Params with 
HasInputCol with HasOutputC
* The number of principal components.
* @group param
*/
-  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components")
+  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components (> 0)",
+ParamValidators.gt(0))
 
   /** @group getParam */
   def getK: Int = $(k)

http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index d53f3df..3ed08c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -43,7 +43,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val vectorSize = new IntParam(
-this, "vectorSize", "the dimension of codes after transforming from words")
+this, "vectorSize", "the dimension of codes after transforming from words 
(> 0)",
+ParamValidators.gt(0))
   setDefault(vectorSize -> 100)
 
   /** @group getParam */
@@ -55,7 +56,8 @@ private[feature] trait Word2VecBase extends Params
* @group expertParam
*/
   final val windowSize = new IntParam(
-this, "windowSize", "the window size (context words from [-window, 
window])")
+this, "windowSize", "the window size (context words from [-window, 
window]) (> 0)",
+ParamValidators.gt(0))
   setDefault(windowSize -> 5)
 
   /** @group expertGetParam */
@@ -67,7 +69,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val numPartitions = new IntParam(
-this, "numPartitions", "number of partitions for sentences of words")
+this, "numPartitions", "number of partitions for sentences of words (> 0)",
+ParamValidators.gt(0))
   setDefault(numPartitions -> 1)
 
   /** @group getParam */
@@ -80,7 +83,7 @@ pri

spark git commit: [SPARK-18434][ML] Add missing ParamValidations for ML algos

2016-11-16 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 820847008 -> 6b6eb4e52


[SPARK-18434][ML] Add missing ParamValidations for ML algos

## What changes were proposed in this pull request?
Add missing ParamValidations for ML algos
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15881 from zhengruifeng/arg_checking.

(cherry picked from commit c68f1a38af67957ee28889667193da8f64bb4342)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b6eb4e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b6eb4e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b6eb4e5

Branch: refs/heads/branch-2.1
Commit: 6b6eb4e520d07a27aa68d3450f3c7613b233d928
Parents: 8208470
Author: Zheng RuiFeng 
Authored: Wed Nov 16 02:46:27 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 02:46:54 2016 -0800

--
 .../main/scala/org/apache/spark/ml/feature/IDF.scala   |  3 ++-
 .../main/scala/org/apache/spark/ml/feature/PCA.scala   |  3 ++-
 .../scala/org/apache/spark/ml/feature/Word2Vec.scala   | 13 -
 .../spark/ml/regression/IsotonicRegression.scala   |  3 ++-
 .../apache/spark/ml/regression/LinearRegression.scala  |  6 +-
 .../scala/org/apache/spark/ml/tree/treeParams.scala|  4 +++-
 6 files changed, 22 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
index 6386dd8..46a0730 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
@@ -44,7 +44,8 @@ private[feature] trait IDFBase extends Params with 
HasInputCol with HasOutputCol
* @group param
*/
   final val minDocFreq = new IntParam(
-this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering")
+this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering" +
+  " (>= 0)", ParamValidators.gtEq(0))
 
   setDefault(minDocFreq -> 0)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
index 6b91348..444006f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
@@ -44,7 +44,8 @@ private[feature] trait PCAParams extends Params with 
HasInputCol with HasOutputC
* The number of principal components.
* @group param
*/
-  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components")
+  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components (> 0)",
+ParamValidators.gt(0))
 
   /** @group getParam */
   def getK: Int = $(k)

http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index d53f3df..3ed08c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -43,7 +43,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val vectorSize = new IntParam(
-this, "vectorSize", "the dimension of codes after transforming from words")
+this, "vectorSize", "the dimension of codes after transforming from words 
(> 0)",
+ParamValidators.gt(0))
   setDefault(vectorSize -> 100)
 
   /** @group getParam */
@@ -55,7 +56,8 @@ private[feature] trait Word2VecBase extends Params
* @group expertParam
*/
   final val windowSize = new IntParam(
-this, "windowSize", "the window size (context words from [-window, 
window])")
+this, "windowSize", "the window size (context words from [-window, 
window]) (> 0)",
+ParamValidators.gt(0))
   setDefault(windowSize -> 5)
 
   /** @group expertGetParam */
@@ -67,7 +69,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val numPartitions = new IntParam(
-this, "numPartitions", "number of partitions for sentences of words")
+this, "numPartitions", "number of partitions for sentences of words (> 0)",
+

spark git commit: [SPARK-18446][ML][DOCS] Add links to API docs for ML algos

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master c68f1a38a -> a75e3fe92


[SPARK-18446][ML][DOCS] Add links to API docs for ML algos

## What changes were proposed in this pull request?
Add links to API docs for ML algos
## How was this patch tested?
Manual checking for the API links

Author: Zheng RuiFeng 

Closes #15890 from zhengruifeng/algo_link.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a75e3fe9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a75e3fe9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a75e3fe9

Branch: refs/heads/master
Commit: a75e3fe923372c56bc1b2f4baeaaf5868ad28341
Parents: c68f1a3
Author: Zheng RuiFeng 
Authored: Wed Nov 16 10:53:23 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:53:23 2016 +

--
 docs/ml-classification-regression.md | 39 +++
 docs/ml-pipeline.md  | 25 
 docs/ml-tuning.md| 17 ++
 3 files changed, 81 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a75e3fe9/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index b10793d..1aacc3e 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -55,14 +55,23 @@ $\alpha$ and `regParam` corresponds to $\lambda$.
 
 
 
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression).
+
 {% include_example 
scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala
 %}
 
 
 
+
+More details on parameters can be found in the [Java API 
documentation](api/java/org/apache/spark/ml/classification/LogisticRegression.html).
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java
 %}
 
 
 
+
+More details on parameters can be found in the [Python API 
documentation](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression).
+
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
@@ -289,14 +298,23 @@ MLPC employs backpropagation for learning the model. We 
use the logistic loss fu
 
 
 
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier)
 for more details.
+
 {% include_example 
scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala 
%}
 
 
 
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html)
 for more details.
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java
 %}
 
 
 
+
+Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classification.MultilayerPerceptronClassifier)
 for more details.
+
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
@@ -392,15 +410,24 @@ regression model and extracting model summary statistics.
 
 
 
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression).
+
 {% include_example 
scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala 
%}
 
 
 
+
+More details on parameters can be found in the [Java API 
documentation](api/java/org/apache/spark/ml/regression/LinearRegression.html).
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
 %}
 
 
 
 
+
+More details on parameters can be found in the [Python API 
documentation](api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression).
+
 {% include_example python/ml/linear_regression_with_elastic_net.py %}
 
 
@@ -519,18 +546,21 @@ function and extracting model summary statistics.
 
 
 
+
 Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression)
 for more details.
 
 {% include_example 
scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
 
 
 
+
 Refer to the [Java API 
docs](api/java/org/apache/spark/ml/regression/GeneralizedLinearRegression.html) 
for more details.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaGeneralizedLinearRegressionExample.java %}
 
 
 
+
 Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.GeneralizedLinearRegression)
 for more details.
 
 {% include_example python/ml/generalized_linear_regression_example.py %}
@@ -705,14 +735,23 @@ The implementation matches the result from R's survival 
function
 
 
 
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.a

spark git commit: [SPARK-18446][ML][DOCS] Add links to API docs for ML algos

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 6b6eb4e52 -> 416bc3dd3


[SPARK-18446][ML][DOCS] Add links to API docs for ML algos

## What changes were proposed in this pull request?
Add links to API docs for ML algos
## How was this patch tested?
Manual checking for the API links

Author: Zheng RuiFeng 

Closes #15890 from zhengruifeng/algo_link.

(cherry picked from commit a75e3fe923372c56bc1b2f4baeaaf5868ad28341)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/416bc3dd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/416bc3dd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/416bc3dd

Branch: refs/heads/branch-2.1
Commit: 416bc3dd3db7f7ae2cc7b3ffe395decd0c5b73f9
Parents: 6b6eb4e
Author: Zheng RuiFeng 
Authored: Wed Nov 16 10:53:23 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 10:53:32 2016 +

--
 docs/ml-classification-regression.md | 39 +++
 docs/ml-pipeline.md  | 25 
 docs/ml-tuning.md| 17 ++
 3 files changed, 81 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/416bc3dd/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index bb2e404..cb2ccbf 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -55,14 +55,23 @@ $\alpha$ and `regParam` corresponds to $\lambda$.
 
 
 
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression).
+
 {% include_example 
scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala
 %}
 
 
 
+
+More details on parameters can be found in the [Java API 
documentation](api/java/org/apache/spark/ml/classification/LogisticRegression.html).
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java
 %}
 
 
 
+
+More details on parameters can be found in the [Python API 
documentation](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression).
+
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
@@ -289,14 +298,23 @@ MLPC employs backpropagation for learning the model. We 
use the logistic loss fu
 
 
 
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier)
 for more details.
+
 {% include_example 
scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala 
%}
 
 
 
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html)
 for more details.
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java
 %}
 
 
 
+
+Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classification.MultilayerPerceptronClassifier)
 for more details.
+
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
@@ -392,15 +410,24 @@ regression model and extracting model summary statistics.
 
 
 
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression).
+
 {% include_example 
scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala 
%}
 
 
 
+
+More details on parameters can be found in the [Java API 
documentation](api/java/org/apache/spark/ml/regression/LinearRegression.html).
+
 {% include_example 
java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
 %}
 
 
 
 
+
+More details on parameters can be found in the [Python API 
documentation](api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression).
+
 {% include_example python/ml/linear_regression_with_elastic_net.py %}
 
 
@@ -519,18 +546,21 @@ function and extracting model summary statistics.
 
 
 
+
 Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression)
 for more details.
 
 {% include_example 
scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
 
 
 
+
 Refer to the [Java API 
docs](api/java/org/apache/spark/ml/regression/GeneralizedLinearRegression.html) 
for more details.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaGeneralizedLinearRegressionExample.java %}
 
 
 
+
 Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.GeneralizedLinearRegression)
 for more details.
 
 {% include_example python/ml/generalized_linear_regression_example.py %}
@@ -705,14 +735,23 @@ The implementation matches t

spark git commit: [SPARK-18420][BUILD] Fix the errors caused by lint check in Java

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master a75e3fe92 -> 7569cf6cb


[SPARK-18420][BUILD] Fix the errors caused by lint check in Java

## What changes were proposed in this pull request?

Small fix, fix the errors caused by lint check in Java

- Clear unused objects and `UnusedImports`.
- Add comments around the method `finalize` of `NioBufferedFileInputStream`to 
turn off checkstyle.
- Cut the line which is longer than 100 characters into two lines.

## How was this patch tested?
Travis CI.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive 
-Phive-thriftserver install
$ dev/lint-java
```
Before:
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
(imports) UnusedImports: Unused import - 
org.apache.commons.crypto.cipher.CryptoCipherFactory.
[ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
(modifier) RedundantModifier: Redundant 'public' modifier.
[ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] 
(coding) NoFinalizer: Avoid using finalizer method.
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] 
(sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
 (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] 
src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17]
 (modifier) ModifierOrder: 'static' modifier out of order with the JLS 
suggestions.
[ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
 (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
(imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
```

After:
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive 
-Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: 
/home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
Checkstyle checks passed.
```

Author: Xianyang Liu 

Closes #15865 from ConeyLiu/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7569cf6c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7569cf6c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7569cf6c

Branch: refs/heads/master
Commit: 7569cf6cb85bda7d0e76d3e75e286d4796e77e08
Parents: a75e3fe
Author: Xianyang Liu 
Authored: Wed Nov 16 11:59:00 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 11:59:00 2016 +

--
 .../org/apache/spark/network/util/TransportConf.java |  1 -
 .../apache/spark/network/sasl/SparkSaslSuite.java|  2 +-
 .../apache/spark/io/NioBufferedFileInputStream.java  |  2 ++
 dev/checkstyle.xml   | 15 +++
 .../spark/examples/ml/JavaInteractionExample.java|  3 +--
 .../JavaLogisticRegressionWithElasticNetExample.java |  4 ++--
 .../sql/catalyst/expressions/UnsafeArrayData.java|  3 ++-
 .../sql/catalyst/expressions/UnsafeMapData.java  |  3 ++-
 .../sql/catalyst/expressions/HiveHasherSuite.java|  1 -
 9 files changed, 25 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7569cf6c/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
index d0d0728..012bb09 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
@@ -18,7 +18,6 @@
 package org.apache.spark.network.util;
 
 import com.google.common.primitives.Ints;
-import org.apache.commons.crypto.cipher.CryptoCipherFactory;
 
 /**
  * A central location that tracks all the settings we expose to users.

http://git-wip-us.apache.org/repos/asf/spark/blob/7569cf6c/common/network-common/src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java
--
diff --git 
a/common/network-common/src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java
 
b/common/network-common/src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java
index 4e6146c..ef2ab34 100644
--- 
a/co

spark git commit: [SPARK-18420][BUILD] Fix the errors caused by lint check in Java

2016-11-16 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 416bc3dd3 -> b0ae87123


[SPARK-18420][BUILD] Fix the errors caused by lint check in Java

Small fix, fix the errors caused by lint check in Java

- Clear unused objects and `UnusedImports`.
- Add comments around the method `finalize` of `NioBufferedFileInputStream`to 
turn off checkstyle.
- Cut the line which is longer than 100 characters into two lines.

Travis CI.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive 
-Phive-thriftserver install
$ dev/lint-java
```
Before:
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
(imports) UnusedImports: Unused import - 
org.apache.commons.crypto.cipher.CryptoCipherFactory.
[ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
(modifier) RedundantModifier: Redundant 'public' modifier.
[ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] 
(coding) NoFinalizer: Avoid using finalizer method.
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] 
(sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
 (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] 
src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17]
 (modifier) ModifierOrder: 'static' modifier out of order with the JLS 
suggestions.
[ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
 (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
(imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
```

After:
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive 
-Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: 
/home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
Checkstyle checks passed.
```

Author: Xianyang Liu 

Closes #15865 from ConeyLiu/master.

(cherry picked from commit 7569cf6cb85bda7d0e76d3e75e286d4796e77e08)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0ae8712
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0ae8712
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0ae8712

Branch: refs/heads/branch-2.1
Commit: b0ae8712358fc8c07aa5efe4d0bd337e7e452078
Parents: 416bc3d
Author: Xianyang Liu 
Authored: Wed Nov 16 11:59:00 2016 +
Committer: Sean Owen 
Committed: Wed Nov 16 12:45:57 2016 +

--
 .../apache/spark/io/NioBufferedFileInputStream.java  |  2 ++
 dev/checkstyle.xml   | 15 +++
 .../spark/examples/ml/JavaInteractionExample.java|  3 +--
 .../JavaLogisticRegressionWithElasticNetExample.java |  4 ++--
 .../sql/catalyst/expressions/UnsafeArrayData.java|  3 ++-
 .../sql/catalyst/expressions/UnsafeMapData.java  |  3 ++-
 .../sql/catalyst/expressions/HiveHasherSuite.java|  1 -
 7 files changed, 24 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b0ae8712/core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java
--
diff --git 
a/core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java 
b/core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java
index f6d1288..ea5f1a9 100644
--- a/core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java
+++ b/core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java
@@ -130,8 +130,10 @@ public final class NioBufferedFileInputStream extends 
InputStream {
 StorageUtils.dispose(byteBuffer);
   }
 
+  //checkstyle.off: NoFinalizer
   @Override
   protected void finalize() throws IOException {
 close();
   }
+  //checkstyle.on: NoFinalizer
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/b0ae8712/dev/checkstyle.xml
--
diff --git a/dev/checkstyle.xml b/dev/checkstyle.xml
index 3de6aa9..92c5251 100644
--- a/dev/checkstyle.xml
+++ b/dev/checkstyle.xml
@@ -52,6 +52,20 @@
   
 
 
+
+
+
+
+
+
+
 
 
 
@@ -168,5 +182,6 @@
 
 
 
+
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b0ae8712/examples/src/main/java/org/apache/spark/examples/ml/JavaInteracti

spark git commit: [SPARK-18430][SQL][BACKPORT-2.0] Fixed Exception Messages when Hitting an Invocation Exception of Function Lookup

2016-11-16 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4f3f09696 -> 10b36d62a


[SPARK-18430][SQL][BACKPORT-2.0] Fixed Exception Messages when Hitting an 
Invocation Exception of Function Lookup

### What changes were proposed in this pull request?
This PR is to backport https://github.com/apache/spark/pull/15878

When the exception is an invocation exception during function lookup, we return 
a useless/confusing error message:

For example,
```Scala
df.selectExpr("concat_ws()")
```
Below is the error message we got:
```
null; line 1 pos 0
org.apache.spark.sql.AnalysisException: null; line 1 pos 0
```

To get the meaningful error message, we need to get the cause. The fix is 
exactly the same as what we did in https://github.com/apache/spark/pull/12136. 
After the fix, the message we got is the exception issued in the constuctor of 
function implementation:
```
requirement failed: concat_ws requires at least one argument.; line 1 pos 0
org.apache.spark.sql.AnalysisException: requirement failed: concat_ws requires 
at least one argument.; line 1 pos 0
```

### How was this patch tested?
Added test cases.

Author: gatorsmile 

Closes #15902 from gatorsmile/functionNotFound20.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10b36d62
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10b36d62
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10b36d62

Branch: refs/heads/branch-2.0
Commit: 10b36d62adeb1ef345d2186ee71fec9db1f17d6e
Parents: 4f3f096
Author: gatorsmile 
Authored: Wed Nov 16 06:02:29 2016 -0800
Committer: Herman van Hovell 
Committed: Wed Nov 16 06:02:29 2016 -0800

--
 .../catalyst/analysis/FunctionRegistry.scala|  5 -
 .../sql-tests/inputs/string-functions.sql   |  3 +++
 .../sql-tests/results/string-functions.sql.out  | 20 
 3 files changed, 27 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10b36d62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 35fd800..73884d7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -445,7 +445,10 @@ object FunctionRegistry {
 // If there is an apply method that accepts Seq[Expression], use that 
one.
 Try(varargCtor.get.newInstance(expressions).asInstanceOf[Expression]) 
match {
   case Success(e) => e
-  case Failure(e) => throw new AnalysisException(e.getMessage)
+  case Failure(e) =>
+// the exception is an invocation exception. To get a meaningful 
message, we need the
+// cause.
+throw new AnalysisException(e.getCause.getMessage)
 }
   } else {
 // Otherwise, find a constructor method that matches the number of 
arguments, and use that.

http://git-wip-us.apache.org/repos/asf/spark/blob/10b36d62/sql/core/src/test/resources/sql-tests/inputs/string-functions.sql
--
diff --git a/sql/core/src/test/resources/sql-tests/inputs/string-functions.sql 
b/sql/core/src/test/resources/sql-tests/inputs/string-functions.sql
new file mode 100644
index 000..21a0aa6
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/string-functions.sql
@@ -0,0 +1,3 @@
+-- Argument number exception
+select concat_ws();
+select format_string();
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/10b36d62/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out
--
diff --git 
a/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out
new file mode 100644
index 000..6961e9b
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out
@@ -0,0 +1,20 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 2
+
+
+-- !query 0
+select concat_ws()
+-- !query 0 schema
+struct<>
+-- !query 0 output
+org.apache.spark.sql.AnalysisException
+requirement failed: concat_ws requires at least one argument.; line 1 pos 7
+
+
+-- !query 1
+select format_string()
+-- !query 1 schema
+struct<>
+-- !query 1 output
+org.apache.spark.sql.AnalysisException
+requirement failed: format_string() should take at least 1 argument; line 1 
pos 7



spark git commit: [SPARK-18415][SQL] Weird Plan Output when CTE used in RunnableCommand

2016-11-16 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 b0ae87123 -> c0dbe08d6


[SPARK-18415][SQL] Weird Plan Output when CTE used in RunnableCommand

### What changes were proposed in this pull request?
Currently, when CTE is used in RunnableCommand, the Analyzer does not replace 
the logical node `With`. The child plan of RunnableCommand is not resolved. 
Thus, the output of the `With` plan node looks very confusing.
For example,
```
sql(
  """
|CREATE VIEW cte_view AS
|WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
|SELECT n FROM w
  """.stripMargin).explain()
```
The output is like
```
ExecutedCommand
   +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), 
cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
 +- 'With [(w,SubqueryAlias w
+- Project [1 AS n#16]
   +- OneRowRelation$
), (cte1,'SubqueryAlias cte1
+- 'Project [unresolvedalias(2, None)]
   +- OneRowRelation$
), (cte2,'SubqueryAlias cte2
+- 'Project [unresolvedalias(3, None)]
   +- OneRowRelation$
)]
+- 'Project ['n]
   +- 'UnresolvedRelation `w`
```
After the fix, the output is as shown below.
```
ExecutedCommand
   +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), 
cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
 +- CTE [w, cte1, cte2]
:  :- SubqueryAlias w
:  :  +- Project [1 AS n#16]
:  : +- OneRowRelation$
:  :- 'SubqueryAlias cte1
:  :  +- 'Project [unresolvedalias(2, None)]
:  : +- OneRowRelation$
:  +- 'SubqueryAlias cte2
: +- 'Project [unresolvedalias(3, None)]
:+- OneRowRelation$
+- 'Project ['n]
   +- 'UnresolvedRelation `w`
```

BTW, this PR also fixes the output of the view type.

### How was this patch tested?
Manual

Author: gatorsmile 

Closes #15854 from gatorsmile/cteName.

(cherry picked from commit 608ecc512b759514c75a1b475582f237ed569f10)
Signed-off-by: Herman van Hovell 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c0dbe08d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c0dbe08d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c0dbe08d

Branch: refs/heads/branch-2.1
Commit: c0dbe08d604dea543eb17ccb802a8a20d6c21a69
Parents: b0ae871
Author: gatorsmile 
Authored: Wed Nov 16 08:25:15 2016 -0800
Committer: Herman van Hovell 
Committed: Wed Nov 16 08:25:29 2016 -0800

--
 .../sql/catalyst/plans/logical/basicLogicalOperators.scala   | 8 
 .../scala/org/apache/spark/sql/execution/command/views.scala | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c0dbe08d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 574caf0..dd6c8fd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
 
 /**
  * When planning take() or collect() operations, this special node that is 
inserted at the top of
@@ -405,6 +406,13 @@ case class InsertIntoTable(
  */
 case class With(child: LogicalPlan, cteRelations: Seq[(String, 
SubqueryAlias)]) extends UnaryNode {
   override def output: Seq[Attribute] = child.output
+
+  override def simpleString: String = {
+val cteAliases = Utils.truncatedString(cteRelations.map(_._1), "[", ", ", 
"]")
+s"CTE $cteAliases"
+  }
+
+  override def innerChildren: Seq[QueryPlan[_]] = cteRelations.map(_._2)
 }
 
 case class WithWindowDefinition(

http://git-wip-us.apache.org/repos/asf/spark/blob/c0dbe08d/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
index 30472ec..154141b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
+++ b/sql/core/src/main/scala/org/apache/sp

spark git commit: [SPARK-18415][SQL] Weird Plan Output when CTE used in RunnableCommand

2016-11-16 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/master 7569cf6cb -> 608ecc512


[SPARK-18415][SQL] Weird Plan Output when CTE used in RunnableCommand

### What changes were proposed in this pull request?
Currently, when CTE is used in RunnableCommand, the Analyzer does not replace 
the logical node `With`. The child plan of RunnableCommand is not resolved. 
Thus, the output of the `With` plan node looks very confusing.
For example,
```
sql(
  """
|CREATE VIEW cte_view AS
|WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
|SELECT n FROM w
  """.stripMargin).explain()
```
The output is like
```
ExecutedCommand
   +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), 
cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
 +- 'With [(w,SubqueryAlias w
+- Project [1 AS n#16]
   +- OneRowRelation$
), (cte1,'SubqueryAlias cte1
+- 'Project [unresolvedalias(2, None)]
   +- OneRowRelation$
), (cte2,'SubqueryAlias cte2
+- 'Project [unresolvedalias(3, None)]
   +- OneRowRelation$
)]
+- 'Project ['n]
   +- 'UnresolvedRelation `w`
```
After the fix, the output is as shown below.
```
ExecutedCommand
   +- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), 
cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
 +- CTE [w, cte1, cte2]
:  :- SubqueryAlias w
:  :  +- Project [1 AS n#16]
:  : +- OneRowRelation$
:  :- 'SubqueryAlias cte1
:  :  +- 'Project [unresolvedalias(2, None)]
:  : +- OneRowRelation$
:  +- 'SubqueryAlias cte2
: +- 'Project [unresolvedalias(3, None)]
:+- OneRowRelation$
+- 'Project ['n]
   +- 'UnresolvedRelation `w`
```

BTW, this PR also fixes the output of the view type.

### How was this patch tested?
Manual

Author: gatorsmile 

Closes #15854 from gatorsmile/cteName.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/608ecc51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/608ecc51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/608ecc51

Branch: refs/heads/master
Commit: 608ecc512b759514c75a1b475582f237ed569f10
Parents: 7569cf6
Author: gatorsmile 
Authored: Wed Nov 16 08:25:15 2016 -0800
Committer: Herman van Hovell 
Committed: Wed Nov 16 08:25:15 2016 -0800

--
 .../sql/catalyst/plans/logical/basicLogicalOperators.scala   | 8 
 .../scala/org/apache/spark/sql/execution/command/views.scala | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/608ecc51/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 4dcc288..4e333d5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
 
 /**
  * When planning take() or collect() operations, this special node that is 
inserted at the top of
@@ -404,6 +405,13 @@ case class InsertIntoTable(
  */
 case class With(child: LogicalPlan, cteRelations: Seq[(String, 
SubqueryAlias)]) extends UnaryNode {
   override def output: Seq[Attribute] = child.output
+
+  override def simpleString: String = {
+val cteAliases = Utils.truncatedString(cteRelations.map(_._1), "[", ", ", 
"]")
+s"CTE $cteAliases"
+  }
+
+  override def innerChildren: Seq[QueryPlan[_]] = cteRelations.map(_._2)
 }
 
 case class WithWindowDefinition(

http://git-wip-us.apache.org/repos/asf/spark/blob/608ecc51/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
index 30472ec..154141b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala
@@ -33,7 +33,9 @@ import org.apache.spark.sql.types.MetadataBuilder
  * Vi

spark git commit: [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and add triggerDetails to json in StreamingQueryStatus

2016-11-16 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master 608ecc512 -> 0048ce7ce


[SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and 
add triggerDetails to json in StreamingQueryStatus

## What changes were proposed in this pull request?

SPARK-18459: triggerId seems like a number that should be increasing with each 
trigger, whether or not there is data in it. However, actually, triggerId 
increases only where there is a batch of data in a trigger. So its better to 
rename it to batchId.

SPARK-18460: triggerDetails was missing from json representation. Fixed it.

## How was this patch tested?
Updated existing unit tests.

Author: Tathagata Das 

Closes #15895 from tdas/SPARK-18459.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0048ce7c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0048ce7c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0048ce7c

Branch: refs/heads/master
Commit: 0048ce7ce64b02cbb6a1c4a2963a0b1b9541047e
Parents: 608ecc5
Author: Tathagata Das 
Authored: Wed Nov 16 10:00:59 2016 -0800
Committer: Shixiong Zhu 
Committed: Wed Nov 16 10:00:59 2016 -0800

--
 python/pyspark/sql/streaming.py |  6 +++---
 .../sql/execution/streaming/StreamMetrics.scala |  8 +++
 .../sql/streaming/StreamingQueryStatus.scala|  4 ++--
 .../streaming/StreamMetricsSuite.scala  |  8 +++
 .../streaming/StreamingQueryListenerSuite.scala |  4 ++--
 .../streaming/StreamingQueryStatusSuite.scala   | 22 ++--
 6 files changed, 35 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0048ce7c/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index f326f16..0e4589b 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -212,12 +212,12 @@ class StreamingQueryStatus(object):
 Processing rate 23.5 rows/sec
 Latency: 345.0 ms
 Trigger details:
+batchId: 5
 isDataPresentInTrigger: true
 isTriggerActive: true
 latency.getBatch.total: 20
 latency.getOffset.total: 10
 numRows.input.total: 100
-triggerId: 5
 Source statuses [1 source]:
 Source 1 - MySource1
 Available offset: 0
@@ -341,8 +341,8 @@ class StreamingQueryStatus(object):
 If no trigger is currently active, then it will have details of the 
last completed trigger.
 
 >>> sqs.triggerDetails
-{u'triggerId': u'5', u'latency.getBatch.total': u'20', 
u'numRows.input.total': u'100',
-u'isTriggerActive': u'true', u'latency.getOffset.total': u'10',
+{u'latency.getBatch.total': u'20', u'numRows.input.total': u'100',
+u'isTriggerActive': u'true', u'batchId': u'5', 
u'latency.getOffset.total': u'10',
 u'isDataPresentInTrigger': u'true'}
 """
 return self._jsqs.triggerDetails()

http://git-wip-us.apache.org/repos/asf/spark/blob/0048ce7c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
index 5645554..942e6ed 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
@@ -78,13 +78,13 @@ class StreamMetrics(sources: Set[Source], triggerClock: 
Clock, codahaleSourceNam
 
   // === Setter methods ===
 
-  def reportTriggerStarted(triggerId: Long): Unit = synchronized {
+  def reportTriggerStarted(batchId: Long): Unit = synchronized {
 numInputRows.clear()
 triggerDetails.clear()
 sourceTriggerDetails.values.foreach(_.clear())
 
-reportTriggerDetail(TRIGGER_ID, triggerId)
-sources.foreach(s => reportSourceTriggerDetail(s, TRIGGER_ID, triggerId))
+reportTriggerDetail(BATCH_ID, batchId)
+sources.foreach(s => reportSourceTriggerDetail(s, BATCH_ID, batchId))
 reportTriggerDetail(IS_TRIGGER_ACTIVE, true)
 currentTriggerStartTimestamp = triggerClock.getTimeMillis()
 reportTriggerDetail(START_TIMESTAMP, currentTriggerStartTimestamp)
@@ -217,7 +217,7 @@ object StreamMetrics extends Logging {
   }
 
 
-  val TRIGGER_ID = "triggerId"
+  val BATCH_ID = "batchId"
   val IS_TRIGGER_ACTIVE = "isTriggerActive"
   val IS_DATA_PRESENT_IN_TRIGGER = "isDataPresentInTrigger"
   val

spark git commit: [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and add triggerDetails to json in StreamingQueryStatus

2016-11-16 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 c0dbe08d6 -> b86e962c9


[SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and 
add triggerDetails to json in StreamingQueryStatus

## What changes were proposed in this pull request?

SPARK-18459: triggerId seems like a number that should be increasing with each 
trigger, whether or not there is data in it. However, actually, triggerId 
increases only where there is a batch of data in a trigger. So its better to 
rename it to batchId.

SPARK-18460: triggerDetails was missing from json representation. Fixed it.

## How was this patch tested?
Updated existing unit tests.

Author: Tathagata Das 

Closes #15895 from tdas/SPARK-18459.

(cherry picked from commit 0048ce7ce64b02cbb6a1c4a2963a0b1b9541047e)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b86e962c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b86e962c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b86e962c

Branch: refs/heads/branch-2.1
Commit: b86e962c90c4322cd98b5bf3b19e251da2d32442
Parents: c0dbe08
Author: Tathagata Das 
Authored: Wed Nov 16 10:00:59 2016 -0800
Committer: Shixiong Zhu 
Committed: Wed Nov 16 10:01:08 2016 -0800

--
 python/pyspark/sql/streaming.py |  6 +++---
 .../sql/execution/streaming/StreamMetrics.scala |  8 +++
 .../sql/streaming/StreamingQueryStatus.scala|  4 ++--
 .../streaming/StreamMetricsSuite.scala  |  8 +++
 .../streaming/StreamingQueryListenerSuite.scala |  4 ++--
 .../streaming/StreamingQueryStatusSuite.scala   | 22 ++--
 6 files changed, 35 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b86e962c/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index f326f16..0e4589b 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -212,12 +212,12 @@ class StreamingQueryStatus(object):
 Processing rate 23.5 rows/sec
 Latency: 345.0 ms
 Trigger details:
+batchId: 5
 isDataPresentInTrigger: true
 isTriggerActive: true
 latency.getBatch.total: 20
 latency.getOffset.total: 10
 numRows.input.total: 100
-triggerId: 5
 Source statuses [1 source]:
 Source 1 - MySource1
 Available offset: 0
@@ -341,8 +341,8 @@ class StreamingQueryStatus(object):
 If no trigger is currently active, then it will have details of the 
last completed trigger.
 
 >>> sqs.triggerDetails
-{u'triggerId': u'5', u'latency.getBatch.total': u'20', 
u'numRows.input.total': u'100',
-u'isTriggerActive': u'true', u'latency.getOffset.total': u'10',
+{u'latency.getBatch.total': u'20', u'numRows.input.total': u'100',
+u'isTriggerActive': u'true', u'batchId': u'5', 
u'latency.getOffset.total': u'10',
 u'isDataPresentInTrigger': u'true'}
 """
 return self._jsqs.triggerDetails()

http://git-wip-us.apache.org/repos/asf/spark/blob/b86e962c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
index 5645554..942e6ed 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
@@ -78,13 +78,13 @@ class StreamMetrics(sources: Set[Source], triggerClock: 
Clock, codahaleSourceNam
 
   // === Setter methods ===
 
-  def reportTriggerStarted(triggerId: Long): Unit = synchronized {
+  def reportTriggerStarted(batchId: Long): Unit = synchronized {
 numInputRows.clear()
 triggerDetails.clear()
 sourceTriggerDetails.values.foreach(_.clear())
 
-reportTriggerDetail(TRIGGER_ID, triggerId)
-sources.foreach(s => reportSourceTriggerDetail(s, TRIGGER_ID, triggerId))
+reportTriggerDetail(BATCH_ID, batchId)
+sources.foreach(s => reportSourceTriggerDetail(s, BATCH_ID, batchId))
 reportTriggerDetail(IS_TRIGGER_ACTIVE, true)
 currentTriggerStartTimestamp = triggerClock.getTimeMillis()
 reportTriggerDetail(START_TIMESTAMP, currentTriggerStartTimestamp)
@@ -217,7 +217,7 @@ object StreamMetrics extends Logging {
   }
 
 
-  val TRIGGER_ID = "triggerId"
+  val BATCH_ID = "batchId"
   va

spark git commit: [SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

2016-11-16 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 b86e962c9 -> 3d4756d56


[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about 
monitoring streaming queries

## What changes were proposed in this pull request?
https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png";>
https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png";>

Author: Tathagata Das 

Closes #15897 from tdas/SPARK-18461.

(cherry picked from commit bb6cdfd9a6a6b6c91aada7c3174436146045ed1e)
Signed-off-by: Michael Armbrust 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3d4756d5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3d4756d5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3d4756d5

Branch: refs/heads/branch-2.1
Commit: 3d4756d56b852dcf4e1bebe621d4a30570873c3c
Parents: b86e962
Author: Tathagata Das 
Authored: Wed Nov 16 11:03:10 2016 -0800
Committer: Michael Armbrust 
Committed: Wed Nov 16 11:03:19 2016 -0800

--
 docs/structured-streaming-programming-guide.md | 182 +++-
 1 file changed, 179 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3d4756d5/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index d254558..77b66b3 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1087,9 +1087,185 @@ spark.streams().awaitAnyTermination()  # block until 
any one of them terminates
 
 
 
-Finally, for asynchronous monitoring of streaming queries, you can create and 
attach a `StreamingQueryListener`
-([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html)
 docs),
-which will give you regular callback-based updates when queries are started 
and terminated.
+
+## Monitoring Streaming Queries
+There are two ways you can monitor queries. You can directly get the current 
status
+of an active query using `streamingQuery.status`, which will return a 
`StreamingQueryStatus` object
+([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryStatus)
 docs)
+that has all the details like current ingestion rates, processing rates, 
average latency,
+details of the currently active trigger, etc.
+
+
+
+
+{% highlight scala %}
+val query: StreamingQuery = ...
+
+println(query.status)
+
+/* Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status timestamp: 123
+Input rate: 15.5 rows/sec
+Processing rate 23.5 rows/sec
+Latency: 345.0 ms
+Trigger details:
+batchId: 5
+isDataPresentInTrigger: true
+isTriggerActive: true
+latency.getBatch.total: 20
+latency.getOffset.total: 10
+numRows.input.total: 100
+Source statuses [1 source]:
+Source 1 - MySource1
+Available offset: 0
+Input rate: 15.5 rows/sec
+Processing rate: 23.5 rows/sec
+Trigger details:
+numRows.input.source: 100
+latency.getOffset.source: 10
+latency.getBatch.source: 20
+Sink status - MySink
+Committed offsets: [1, -]
+*/
+{% endhighlight %}
+
+
+
+
+{% highlight java %}
+StreamingQuery query = ...
+
+System.out.println(query.status);
+
+/* Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status timestamp: 123
+Input rate: 15.5 rows/sec
+Processing rate 23.5 rows/sec
+Latency: 345.0 ms
+Trigger details:
+batchId: 5
+isDataPresentInTrigger: true
+isTriggerActive: true
+latency.getBatch.total: 20
+latency.getOffset.total: 10
+numRows.input.total: 100
+Source statuses [1 source]:
+Source 1 - MySource1
+Available offset: 0
+Input rate: 15.5 rows/sec
+Processing rate: 23.5 rows/sec
+Trigger details:
+numRows.input.source: 100
+latency.getOffset.source: 10
+latency.getBatch.source: 20
+Sink status - MySink
+Committed offsets: [1, -]
+*/
+{% endhighlight %}
+
+
+
+
+{% highlight python %}
+query = ...  // a StreamingQuery
+
+print(query.status)
+
+'''
+Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status times

spark git commit: [SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

2016-11-16 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master 0048ce7ce -> bb6cdfd9a


[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about 
monitoring streaming queries

## What changes were proposed in this pull request?
https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png";>
https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png";>

Author: Tathagata Das 

Closes #15897 from tdas/SPARK-18461.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb6cdfd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb6cdfd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb6cdfd9

Branch: refs/heads/master
Commit: bb6cdfd9a6a6b6c91aada7c3174436146045ed1e
Parents: 0048ce7
Author: Tathagata Das 
Authored: Wed Nov 16 11:03:10 2016 -0800
Committer: Michael Armbrust 
Committed: Wed Nov 16 11:03:10 2016 -0800

--
 docs/structured-streaming-programming-guide.md | 182 +++-
 1 file changed, 179 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb6cdfd9/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index d254558..77b66b3 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1087,9 +1087,185 @@ spark.streams().awaitAnyTermination()  # block until 
any one of them terminates
 
 
 
-Finally, for asynchronous monitoring of streaming queries, you can create and 
attach a `StreamingQueryListener`
-([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html)
 docs),
-which will give you regular callback-based updates when queries are started 
and terminated.
+
+## Monitoring Streaming Queries
+There are two ways you can monitor queries. You can directly get the current 
status
+of an active query using `streamingQuery.status`, which will return a 
`StreamingQueryStatus` object
+([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryStatus)
 docs)
+that has all the details like current ingestion rates, processing rates, 
average latency,
+details of the currently active trigger, etc.
+
+
+
+
+{% highlight scala %}
+val query: StreamingQuery = ...
+
+println(query.status)
+
+/* Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status timestamp: 123
+Input rate: 15.5 rows/sec
+Processing rate 23.5 rows/sec
+Latency: 345.0 ms
+Trigger details:
+batchId: 5
+isDataPresentInTrigger: true
+isTriggerActive: true
+latency.getBatch.total: 20
+latency.getOffset.total: 10
+numRows.input.total: 100
+Source statuses [1 source]:
+Source 1 - MySource1
+Available offset: 0
+Input rate: 15.5 rows/sec
+Processing rate: 23.5 rows/sec
+Trigger details:
+numRows.input.source: 100
+latency.getOffset.source: 10
+latency.getBatch.source: 20
+Sink status - MySink
+Committed offsets: [1, -]
+*/
+{% endhighlight %}
+
+
+
+
+{% highlight java %}
+StreamingQuery query = ...
+
+System.out.println(query.status);
+
+/* Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status timestamp: 123
+Input rate: 15.5 rows/sec
+Processing rate 23.5 rows/sec
+Latency: 345.0 ms
+Trigger details:
+batchId: 5
+isDataPresentInTrigger: true
+isTriggerActive: true
+latency.getBatch.total: 20
+latency.getOffset.total: 10
+numRows.input.total: 100
+Source statuses [1 source]:
+Source 1 - MySource1
+Available offset: 0
+Input rate: 15.5 rows/sec
+Processing rate: 23.5 rows/sec
+Trigger details:
+numRows.input.source: 100
+latency.getOffset.source: 10
+latency.getBatch.source: 20
+Sink status - MySink
+Committed offsets: [1, -]
+*/
+{% endhighlight %}
+
+
+
+
+{% highlight python %}
+query = ...  // a StreamingQuery
+
+print(query.status)
+
+'''
+Will print the current status of the query
+
+Status of query 'queryName'
+Query id: 1
+Status timestamp: 123
+Input rate: 15.5 rows/sec
+Processing rate 23.5 rows/sec
+Latency: 345.0 ms
+Trigger

spark git commit: [SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and add triggerDetails to json in StreamingQueryStatus (for branch-2.0)

2016-11-16 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 10b36d62a -> 37e6d9930


[SPARK-18459][SPARK-18460][STRUCTUREDSTREAMING] Rename triggerId to batchId and 
add triggerDetails to json in StreamingQueryStatus (for branch-2.0)

This is a fix for branch-2.0 for the earlier PR #15895

## What changes were proposed in this pull request?

SPARK-18459: triggerId seems like a number that should be increasing with each 
trigger, whether or not there is data in it. However, actually, triggerId 
increases only where there is a batch of data in a trigger. So its better to 
rename it to batchId.

SPARK-18460: triggerDetails was missing from json representation. Fixed it.
## How was this patch tested?

Updated tests

Author: Tathagata Das 

Closes #15908 from tdas/SPARK-18459-2.0.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/37e6d993
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/37e6d993
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/37e6d993

Branch: refs/heads/branch-2.0
Commit: 37e6d9930a6c9f2a4b8b262d5822e24792b6d2f3
Parents: 10b36d6
Author: Tathagata Das 
Authored: Wed Nov 16 13:20:52 2016 -0800
Committer: Shixiong Zhu 
Committed: Wed Nov 16 13:20:52 2016 -0800

--
 python/pyspark/sql/streaming.py |  6 +++---
 .../sql/execution/streaming/StreamMetrics.scala |  8 +++
 .../sql/streaming/StreamingQueryStatus.scala|  4 ++--
 .../streaming/StreamMetricsSuite.scala  |  8 +++
 .../streaming/StreamingQueryListenerSuite.scala |  4 ++--
 .../streaming/StreamingQueryStatusSuite.scala   | 22 ++--
 6 files changed, 35 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/37e6d993/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index cfe917b..be6a9c6 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -212,12 +212,12 @@ class StreamingQueryStatus(object):
 Processing rate 23.5 rows/sec
 Latency: 345.0 ms
 Trigger details:
+batchId: 5
 isDataPresentInTrigger: true
 isTriggerActive: true
 latency.getBatch.total: 20
 latency.getOffset.total: 10
 numRows.input.total: 100
-triggerId: 5
 Source statuses [1 source]:
 Source 1 - MySource1
 Available offset: #0
@@ -341,8 +341,8 @@ class StreamingQueryStatus(object):
 If no trigger is currently active, then it will have details of the 
last completed trigger.
 
 >>> sqs.triggerDetails
-{u'triggerId': u'5', u'latency.getBatch.total': u'20', 
u'numRows.input.total': u'100',
-u'isTriggerActive': u'true', u'latency.getOffset.total': u'10',
+{u'latency.getBatch.total': u'20', u'numRows.input.total': u'100',
+u'isTriggerActive': u'true', u'batchId': u'5', 
u'latency.getOffset.total': u'10',
 u'isDataPresentInTrigger': u'true'}
 """
 return self._jsqs.triggerDetails()

http://git-wip-us.apache.org/repos/asf/spark/blob/37e6d993/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
index e98d188..aebae31 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
@@ -78,13 +78,13 @@ class StreamMetrics(sources: Set[Source], triggerClock: 
Clock, codahaleSourceNam
 
   // === Setter methods ===
 
-  def reportTriggerStarted(triggerId: Long): Unit = synchronized {
+  def reportTriggerStarted(batchId: Long): Unit = synchronized {
 numInputRows.clear()
 triggerDetails.clear()
 sourceTriggerDetails.values.foreach(_.clear())
 
-reportTriggerDetail(TRIGGER_ID, triggerId)
-sources.foreach(s => reportSourceTriggerDetail(s, TRIGGER_ID, triggerId))
+reportTriggerDetail(BATCH_ID, batchId)
+sources.foreach(s => reportSourceTriggerDetail(s, BATCH_ID, batchId))
 reportTriggerDetail(IS_TRIGGER_ACTIVE, true)
 currentTriggerStartTimestamp = triggerClock.getTimeMillis()
 reportTriggerDetail(START_TIMESTAMP, currentTriggerStartTimestamp)
@@ -217,7 +217,7 @@ object StreamMetrics extends Logging {
   }
 
 
-  val TRIGGER_ID = "triggerId"
+  val BATCH_ID = "batchId"
   val IS_TRIGGER_ACTIVE = "isTriggerActi

spark git commit: [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

2016-11-16 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master bb6cdfd9a -> a36a76ac4


[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of 
work to copy the jars over and package them with the Python code (to prevent 
challenges from trying to use different versions of the Python code with 
different versions of the JAR). It does not currently publish to PyPI but that 
is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual 
tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with 
in the project or should we ask PyPI or should we choose a different name to 
publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the 
wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've 
just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide 
install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload 
artifacts to Apache staging websites)

Author: Holden Karau 
Author: Juliet Hougland 
Author: Juliet Hougland 

Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a36a76ac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a36a76ac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a36a76ac

Branch: refs/heads/master
Commit: a36a76ac43c36a3b897a748bd9f138b629dbc684
Parents: bb6cdfd
Author: Holden Karau 
Authored: Wed Nov 16 14:22:15 2016 -0800
Committer: Josh Rosen 
Committed: Wed Nov 16 14:22:15 2016 -0800

--
 .gitignore  |   2 +
 bin/beeline |   2 +-
 bin/find-spark-home |  41 
 bin/load-spark-env.sh   |   2 +-
 bin/pyspark |   6 +-
 bin/run-example |   2 +-
 bin/spark-class |   6 +-
 bin/spark-shell |   4 +-
 bin/spark-sql   |   2 +-
 bin/spark-submit|   2 +-
 bin/sparkR  |   2 +-
 dev/create-release/release-build.sh |  26 ++-
 dev/create-release/release-tag.sh   |  11 +-
 dev/lint-python |   4 +-
 dev/make-distribution.sh|  16 +-
 dev/pip-sanity-check.py |  36 
 dev/run-pip-tests   | 115 ++
 dev/run-tests-jenkins.py|   1 +
 dev/run-tests.py|   7 +
 dev/sparktestsupport/__init__.py|   1 +
 docs/building-spark.md  |   8 +
 docs/index.md   |   4 +-
 .../spark/launcher/CommandBuilderUtils.java |   2 +-
 python/MANIFEST.in  |  22 ++
 python/README.md|  32 +++
 python/pyspark/__init__.py  |   1 +
 python/pyspark/find_spark_home.py   |  74 +++
 python/pyspark/java_gateway.py  |   3 +-
 python/pyspark/version.py   |  19 ++
 python/setup.cfg|  22 ++
 python/setup.py | 209 +++
 31 files changed, 660 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a36a76ac/.gitignore
--
diff --git a/.gitignore b/.gitignore
index 39d17e1..5634a43 100644
--- a/.gitignore
+++ b/.gitignore
@@ -57,6 +57,8 @@ project/plugins/project/build.properties
 project/plugins/src_managed/
 project/plugins/target/
 python/lib/pyspark.zip
+python/deps
+python/pyspark/python
 reports/
 scalastyle-on-compile.generated.xml
 scalastyle-output.xml

http://git-wip-us.apache.org/repos/asf/spark/blob/a36a76ac/bin/beeline
-

spark git commit: [SPARK-18186] Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-11-16 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master a36a76ac4 -> 2ca8ae9aa


[SPARK-18186] Migrate HiveUDAFFunction to TypedImperativeAggregate for partial 
aggregation support

## What changes were proposed in this pull request?

While being evaluated in Spark SQL, Hive UDAFs don't support partial 
aggregation. This PR migrates `HiveUDAFFunction`s to 
`TypedImperativeAggregate`, which already provides partial aggregation support 
for aggregate functions that may use arbitrary Java objects as aggregation 
states.

The following snippet shows the effect of this PR:

```scala
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax
sql(s"CREATE FUNCTION hive_max AS '${classOf[GenericUDAFMax].getName}'")

spark.range(100).createOrReplaceTempView("t")

// A query using both Spark SQL native `max` and Hive `max`
sql(s"SELECT max(id), hive_max(id) FROM t").explain()
```

Before this PR:

```
== Physical Plan ==
SortAggregate(key=[], functions=[max(id#1L), default.hive_max(default.hive_max, 
HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax7475f57e),
 id#1L, false, 0, 0)])
+- Exchange SinglePartition
   +- *Range (0, 100, step=1, splits=Some(1))
```

After this PR:

```
== Physical Plan ==
SortAggregate(key=[], functions=[max(id#1L), default.hive_max(default.hive_max, 
HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax5e18a6a7),
 id#1L, false, 0, 0)])
+- Exchange SinglePartition
   +- SortAggregate(key=[], functions=[partial_max(id#1L), 
partial_default.hive_max(default.hive_max, 
HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax5e18a6a7),
 id#1L, false, 0, 0)])
  +- *Range (0, 100, step=1, splits=Some(1))
```

The tricky part of the PR is mostly about updating and passing around 
aggregation states of `HiveUDAFFunction`s since the aggregation state of a Hive 
UDAF may appear in three different forms. Let's take a look at the testing 
`MockUDAF` added in this PR as an example. This UDAF computes the count of 
non-null values together with the count of nulls of a given column. Its 
aggregation state may appear as the following forms at different time:

1. A `MockUDAFBuffer`, which is a concrete subclass of 
`GenericUDAFEvaluator.AggregationBuffer`

   The form used by Hive UDAF API. This form is required by the following 
scenarios:

   - Calling `GenericUDAFEvaluator.iterate()` to update an existing aggregation 
state with new input values.
   - Calling `GenericUDAFEvaluator.terminate()` to get the final aggregated 
value from an existing aggregation state.
   - Calling `GenericUDAFEvaluator.merge()` to merge other aggregation states 
into an existing aggregation state.

 The existing aggregation state to be updated must be in this form.

   Conversions:

   - To form 2:

 `GenericUDAFEvaluator.terminatePartial()`

   - To form 3:

 Convert to form 2 first, and then to 3.

2. An `Object[]` array containing two `java.lang.Long` values.

   The form used to interact with Hive's `ObjectInspector`s. This form is 
required by the following scenarios:

   - Calling `GenericUDAFEvaluator.terminatePartial()` to convert an existing 
aggregation state in form 1 to form 2.
   - Calling `GenericUDAFEvaluator.merge()` to merge other aggregation states 
into an existing aggregation state.

 The input aggregation state must be in this form.

   Conversions:

   - To form 1:

 No direct method. Have to create an empty `AggregationBuffer` and merge it 
into the empty buffer.

   - To form 3:

 `unwrapperFor()`/`unwrap()` method of `HiveInspectors`

3. The byte array that holds data of an `UnsafeRow` with two `LongType` fields.

   The form used by Spark SQL to shuffle partial aggregation results. This form 
is required because `TypedImperativeAggregate` always asks its subclasses to 
serialize their aggregation states into a byte array.

   Conversions:

   - To form 1:

 Convert to form 2 first, and then to 1.

   - To form 2:

 `wrapperFor()`/`wrap()` method of `HiveInspectors`

Here're some micro-benchmark results produced by the most recent master and 
this PR branch.

Master:

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

hive udaf vs spark af:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

w/o groupBy339 /  372  3.1 
323.2   1.0X
w/ groupBy 503 /  529  2.1 
479.7   0.7X
```

This PR:

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

hive udaf vs spark af:   Best/Avg Time(ms)Rate

spark git commit: [YARN][DOC] Increasing NodeManager's heap size with External Shuffle Service

2016-11-16 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 2ca8ae9aa -> 55589987b


[YARN][DOC] Increasing NodeManager's heap size with External Shuffle Service

## What changes were proposed in this pull request?

Suggest users to increase `NodeManager's` heap size if `External Shuffle 
Service` is enabled as
`NM` can spend a lot of time doing GC resulting in  shuffle operations being a 
bottleneck due to `Shuffle Read blocked time` bumped up.
Also because of GC  `NodeManager` can use an enormous amount of CPU and cluster 
performance will suffer.
I have seen NodeManager using 5-13G RAM and up to 2700% CPU with 
`spark_shuffle` service on.

## How was this patch tested?

 Added step 5:
![shuffle_service](https://cloud.githubusercontent.com/assets/15244468/20355499/2fec0fde-ac2a-11e6-8f8b-1c80daf71be1.png)

Author: Artur Sukhenko 

Closes #15906 from Devian-ua/nmHeapSize.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/55589987
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/55589987
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/55589987

Branch: refs/heads/master
Commit: 55589987be89ff78dadf44498352fbbd811a206e
Parents: 2ca8ae9
Author: Artur Sukhenko 
Authored: Wed Nov 16 15:08:01 2016 -0800
Committer: Reynold Xin 
Committed: Wed Nov 16 15:08:01 2016 -0800

--
 docs/running-on-yarn.md | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/55589987/docs/running-on-yarn.md
--
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index cd18808..fe0221c 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -559,6 +559,8 @@ pre-packaged distribution.
 1. In the `yarn-site.xml` on each node, add `spark_shuffle` to 
`yarn.nodemanager.aux-services`,
 then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
 `org.apache.spark.network.yarn.YarnShuffleService`.
+1. Increase `NodeManager's` heap size by setting `YARN_HEAPSIZE` (1000 by 
default) in `etc/hadoop/yarn-env.sh` 
+to avoid garbage collection issues during shuffle. 
 1. Restart all `NodeManager`s in your cluster.
 
 The following extra configuration options are available when the shuffle 
service is running on YARN:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [YARN][DOC] Increasing NodeManager's heap size with External Shuffle Service

2016-11-16 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3d4756d56 -> 523abfe19


[YARN][DOC] Increasing NodeManager's heap size with External Shuffle Service

## What changes were proposed in this pull request?

Suggest users to increase `NodeManager's` heap size if `External Shuffle 
Service` is enabled as
`NM` can spend a lot of time doing GC resulting in  shuffle operations being a 
bottleneck due to `Shuffle Read blocked time` bumped up.
Also because of GC  `NodeManager` can use an enormous amount of CPU and cluster 
performance will suffer.
I have seen NodeManager using 5-13G RAM and up to 2700% CPU with 
`spark_shuffle` service on.

## How was this patch tested?

 Added step 5:
![shuffle_service](https://cloud.githubusercontent.com/assets/15244468/20355499/2fec0fde-ac2a-11e6-8f8b-1c80daf71be1.png)

Author: Artur Sukhenko 

Closes #15906 from Devian-ua/nmHeapSize.

(cherry picked from commit 55589987be89ff78dadf44498352fbbd811a206e)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/523abfe1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/523abfe1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/523abfe1

Branch: refs/heads/branch-2.1
Commit: 523abfe19caa11747133877b0c8319c68ac66e56
Parents: 3d4756d
Author: Artur Sukhenko 
Authored: Wed Nov 16 15:08:01 2016 -0800
Committer: Reynold Xin 
Committed: Wed Nov 16 15:08:10 2016 -0800

--
 docs/running-on-yarn.md | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/523abfe1/docs/running-on-yarn.md
--
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index cd18808..fe0221c 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -559,6 +559,8 @@ pre-packaged distribution.
 1. In the `yarn-site.xml` on each node, add `spark_shuffle` to 
`yarn.nodemanager.aux-services`,
 then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
 `org.apache.spark.network.yarn.YarnShuffleService`.
+1. Increase `NodeManager's` heap size by setting `YARN_HEAPSIZE` (1000 by 
default) in `etc/hadoop/yarn-env.sh` 
+to avoid garbage collection issues during shuffle. 
 1. Restart all `NodeManager`s in your cluster.
 
 The following extra configuration options are available when the shuffle 
service is running on YARN:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18442][SQL] Fix nullability of WrapOption.

2016-11-16 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 55589987b -> 170eeb345


[SPARK-18442][SQL] Fix nullability of WrapOption.

## What changes were proposed in this pull request?

The nullability of `WrapOption` should be `false`.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN 

Closes #15887 from ueshin/issues/SPARK-18442.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/170eeb34
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/170eeb34
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/170eeb34

Branch: refs/heads/master
Commit: 170eeb345f951de89a39fe565697b3e913011768
Parents: 5558998
Author: Takuya UESHIN 
Authored: Thu Nov 17 11:21:08 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Nov 17 11:21:08 2016 +0800

--
 .../apache/spark/sql/catalyst/expressions/objects/objects.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/170eeb34/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index 50e2ac3..0e3d991 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -341,7 +341,7 @@ case class WrapOption(child: Expression, optType: DataType)
 
   override def dataType: DataType = ObjectType(classOf[Option[_]])
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = false
 
   override def inputTypes: Seq[AbstractDataType] = optType :: Nil
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18442][SQL] Fix nullability of WrapOption.

2016-11-16 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 523abfe19 -> 951579382


[SPARK-18442][SQL] Fix nullability of WrapOption.

## What changes were proposed in this pull request?

The nullability of `WrapOption` should be `false`.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN 

Closes #15887 from ueshin/issues/SPARK-18442.

(cherry picked from commit 170eeb345f951de89a39fe565697b3e913011768)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95157938
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95157938
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95157938

Branch: refs/heads/branch-2.1
Commit: 9515793820c7954d82116238a67e632ea3e783b5
Parents: 523abfe
Author: Takuya UESHIN 
Authored: Thu Nov 17 11:21:08 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Nov 17 11:21:23 2016 +0800

--
 .../apache/spark/sql/catalyst/expressions/objects/objects.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/95157938/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index 50e2ac3..0e3d991 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -341,7 +341,7 @@ case class WrapOption(child: Expression, optType: DataType)
 
   override def dataType: DataType = ObjectType(classOf[Option[_]])
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = false
 
   override def inputTypes: Seq[AbstractDataType] = optType :: Nil
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

2016-11-16 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 951579382 -> 6a3cbbc03


[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of 
work to copy the jars over and package them with the Python code (to prevent 
challenges from trying to use different versions of the Python code with 
different versions of the JAR). It does not currently publish to PyPI but that 
is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual 
tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with 
in the project or should we ask PyPI or should we choose a different name to 
publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the 
wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've 
just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide 
install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload 
artifacts to Apache staging websites)

Author: Holden Karau 
Author: Juliet Hougland 
Author: Juliet Hougland 

Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a3cbbc0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a3cbbc0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a3cbbc0

Branch: refs/heads/branch-2.1
Commit: 6a3cbbc037fe631e1b89c46000373dc2ba86a5eb
Parents: 9515793
Author: Holden Karau 
Authored: Wed Nov 16 14:22:15 2016 -0800
Committer: Josh Rosen 
Committed: Wed Nov 16 20:15:57 2016 -0800

--
 .gitignore  |   2 +
 bin/beeline |   2 +-
 bin/find-spark-home |  41 
 bin/load-spark-env.sh   |   2 +-
 bin/pyspark |   6 +-
 bin/run-example |   2 +-
 bin/spark-class |   6 +-
 bin/spark-shell |   4 +-
 bin/spark-sql   |   2 +-
 bin/spark-submit|   2 +-
 bin/sparkR  |   2 +-
 dev/create-release/release-build.sh |  26 ++-
 dev/create-release/release-tag.sh   |  11 +-
 dev/lint-python |   4 +-
 dev/make-distribution.sh|  16 +-
 dev/pip-sanity-check.py |  36 
 dev/run-pip-tests   | 115 ++
 dev/run-tests-jenkins.py|   1 +
 dev/run-tests.py|   7 +
 dev/sparktestsupport/__init__.py|   1 +
 docs/building-spark.md  |   8 +
 docs/index.md   |   4 +-
 .../spark/launcher/CommandBuilderUtils.java |   2 +-
 python/MANIFEST.in  |  22 ++
 python/README.md|  32 +++
 python/pyspark/__init__.py  |   1 +
 python/pyspark/find_spark_home.py   |  74 +++
 python/pyspark/java_gateway.py  |   3 +-
 python/pyspark/version.py   |  19 ++
 python/setup.cfg|  22 ++
 python/setup.py | 209 +++
 31 files changed, 660 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a3cbbc0/.gitignore
--
diff --git a/.gitignore b/.gitignore
index 39d17e1..5634a43 100644
--- a/.gitignore
+++ b/.gitignore
@@ -57,6 +57,8 @@ project/plugins/project/build.properties
 project/plugins/src_managed/
 project/plugins/target/
 python/lib/pyspark.zip
+python/deps
+python/pyspark/python
 reports/
 scalastyle-on-compile.generated.xml
 scalastyle-output.xml

http://git-wip-us.apache.org/repos/asf/spark/blob/6a3cbbc0/bin/beeline
-