date:20170503

spark git commit: [SPARK-20544][SPARKR] skip tests when running on CRAN

2017-05-03 Thread felixcheung

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 d8bd213f1 -> 5fe9313d7


[SPARK-20544][SPARKR] skip tests when running on CRAN

General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break

unit tests, `R CMD check --as-cran`, `R CMD check`

Author: Felix Cheung 

Closes #17817 from felixcheung/rskiptest.

(cherry picked from commit fc472bddd1d9c6a28e57e31496c0166777af597e)
Signed-off-by: Felix Cheung 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5fe9313d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5fe9313d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5fe9313d

Branch: refs/heads/branch-2.2
Commit: 5fe9313d7c81679981000b8aea5ea4668a0a0bc8
Parents: d8bd213
Author: Felix Cheung 
Authored: Wed May 3 21:40:18 2017 -0700
Committer: Felix Cheung 
Committed: Wed May 3 21:51:33 2017 -0700

--
 R/pkg/inst/tests/testthat/test_Serde.R  |   6 ++
 R/pkg/inst/tests/testthat/test_Windows.R|   2 +
 R/pkg/inst/tests/testthat/test_binaryFile.R |   8 ++
 .../inst/tests/testthat/test_binary_function.R  |   6 ++
 R/pkg/inst/tests/testthat/test_broadcast.R  |   4 +
 R/pkg/inst/tests/testthat/test_client.R |   8 ++
 R/pkg/inst/tests/testthat/test_context.R|  16 +++
 R/pkg/inst/tests/testthat/test_includePackage.R |   4 +
 .../inst/tests/testthat/test_mllib_clustering.R |   4 +
 .../inst/tests/testthat/test_mllib_regression.R |  12 +++
 .../tests/testthat/test_parallelize_collect.R   |   8 ++
 R/pkg/inst/tests/testthat/test_rdd.R| 106 ++-
 R/pkg/inst/tests/testthat/test_shuffle.R|  24 +
 R/pkg/inst/tests/testthat/test_sparkR.R |   2 +
 R/pkg/inst/tests/testthat/test_sparkSQL.R   |  60 +++
 R/pkg/inst/tests/testthat/test_streaming.R  |  12 +++
 R/pkg/inst/tests/testthat/test_take.R   |   2 +
 R/pkg/inst/tests/testthat/test_textFile.R   |  18 
 R/pkg/inst/tests/testthat/test_utils.R  |   5 +
 R/run-tests.sh  |   2 +-
 20 files changed, 306 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_Serde.R
--
diff --git a/R/pkg/inst/tests/testthat/test_Serde.R 
b/R/pkg/inst/tests/testthat/test_Serde.R
index b5f6f1b..518fb7b 100644
--- a/R/pkg/inst/tests/testthat/test_Serde.R
+++ b/R/pkg/inst/tests/testthat/test_Serde.R
@@ -20,6 +20,8 @@ context("SerDe functionality")
 sparkSession <- sparkR.session(enableHiveSupport = FALSE)
 
 test_that("SerDe of primitive types", {
+  skip_on_cran()
+
   x <- callJStatic("SparkRHandler", "echo", 1L)
   expect_equal(x, 1L)
   expect_equal(class(x), "integer")
@@ -38,6 +40,8 @@ test_that("SerDe of primitive types", {
 })
 
 test_that("SerDe of list of primitive types", {
+  skip_on_cran()
+
   x <- list(1L, 2L, 3L)
   y <- callJStatic("SparkRHandler", "echo", x)
   expect_equal(x, y)
@@ -65,6 +69,8 @@ test_that("SerDe of list of primitive types", {
 })
 
 test_that("SerDe of list of lists", {
+  skip_on_cran()
+
   x <- list(list(1L, 2L, 3L), list(1, 2, 3),
 list(TRUE, FALSE), list("a", "b", "c"))
   y <- callJStatic("SparkRHandler", "echo", x)

http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_Windows.R
--
diff --git a/R/pkg/inst/tests/testthat/test_Windows.R 
b/R/pkg/inst/tests/testthat/test_Windows.R
index 1d777dd..919b063 100644
--- a/R/pkg/inst/tests/testthat/test_Windows.R
+++ b/R/pkg/inst/tests/testthat/test_Windows.R
@@ -17,6 +17,8 @@
 context("Windows-specific tests")
 
 test_that("sparkJars tag in SparkContext", {
+  skip_on_cran()
+
   if (.Platform$OS.type != "windows") {
 skip("This test is only for Windows, skipped")
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_binaryFile.R
--
diff --git a/R/pkg/inst/tests/testthat/test_binaryFile.R 
b/R/pkg/inst/tests/testthat/test_binaryFile.R
index b5c279e..63f54e1 100644
--- a/R/pkg/inst/tests/testthat/test_binaryFile.R
+++ b/R/pkg/inst/tests/testthat/test_binaryFile.R
@@ -24,6 +24,8 @@ sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext",
 mockFile <- c("Spark is pretty.", "Spark is awesome.")
 
 test_that("saveAsObjectFile()/objectFile() following textFile() works", {
+  skip_on_cran()
+
   fileName1 <- tempfile(pattern = "spark-test",

spark git commit: [SPARK-20543][SPARKR] skip tests when running on CRAN

2017-05-03 Thread felixcheung

Repository: spark
Updated Branches:
  refs/heads/master 02bbe7311 -> fc472bddd


[SPARK-20543][SPARKR] skip tests when running on CRAN

## What changes were proposed in this pull request?

General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break

## How was this patch tested?

unit tests, `R CMD check --as-cran`, `R CMD check`

Author: Felix Cheung 

Closes #17817 from felixcheung/rskiptest.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc472bdd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc472bdd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc472bdd

Branch: refs/heads/master
Commit: fc472bddd1d9c6a28e57e31496c0166777af597e
Parents: 02bbe73
Author: Felix Cheung 
Authored: Wed May 3 21:40:18 2017 -0700
Committer: Felix Cheung 
Committed: Wed May 3 21:40:18 2017 -0700

--
 R/pkg/inst/tests/testthat/test_Serde.R  |   6 ++
 R/pkg/inst/tests/testthat/test_Windows.R|   2 +
 R/pkg/inst/tests/testthat/test_binaryFile.R |   8 ++
 .../inst/tests/testthat/test_binary_function.R  |   6 ++
 R/pkg/inst/tests/testthat/test_broadcast.R  |   4 +
 R/pkg/inst/tests/testthat/test_client.R |   8 ++
 R/pkg/inst/tests/testthat/test_context.R|  16 +++
 R/pkg/inst/tests/testthat/test_includePackage.R |   4 +
 .../inst/tests/testthat/test_mllib_clustering.R |   4 +
 .../inst/tests/testthat/test_mllib_regression.R |  12 +++
 .../tests/testthat/test_parallelize_collect.R   |   8 ++
 R/pkg/inst/tests/testthat/test_rdd.R| 106 ++-
 R/pkg/inst/tests/testthat/test_shuffle.R|  24 +
 R/pkg/inst/tests/testthat/test_sparkR.R |   2 +
 R/pkg/inst/tests/testthat/test_sparkSQL.R   |  61 ++-
 R/pkg/inst/tests/testthat/test_streaming.R  |  12 +++
 R/pkg/inst/tests/testthat/test_take.R   |   2 +
 R/pkg/inst/tests/testthat/test_textFile.R   |  18 
 R/pkg/inst/tests/testthat/test_utils.R  |   6 ++
 R/run-tests.sh  |   2 +-
 20 files changed, 307 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_Serde.R
--
diff --git a/R/pkg/inst/tests/testthat/test_Serde.R 
b/R/pkg/inst/tests/testthat/test_Serde.R
index b5f6f1b..518fb7b 100644
--- a/R/pkg/inst/tests/testthat/test_Serde.R
+++ b/R/pkg/inst/tests/testthat/test_Serde.R
@@ -20,6 +20,8 @@ context("SerDe functionality")
 sparkSession <- sparkR.session(enableHiveSupport = FALSE)
 
 test_that("SerDe of primitive types", {
+  skip_on_cran()
+
   x <- callJStatic("SparkRHandler", "echo", 1L)
   expect_equal(x, 1L)
   expect_equal(class(x), "integer")
@@ -38,6 +40,8 @@ test_that("SerDe of primitive types", {
 })
 
 test_that("SerDe of list of primitive types", {
+  skip_on_cran()
+
   x <- list(1L, 2L, 3L)
   y <- callJStatic("SparkRHandler", "echo", x)
   expect_equal(x, y)
@@ -65,6 +69,8 @@ test_that("SerDe of list of primitive types", {
 })
 
 test_that("SerDe of list of lists", {
+  skip_on_cran()
+
   x <- list(list(1L, 2L, 3L), list(1, 2, 3),
 list(TRUE, FALSE), list("a", "b", "c"))
   y <- callJStatic("SparkRHandler", "echo", x)

http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_Windows.R
--
diff --git a/R/pkg/inst/tests/testthat/test_Windows.R 
b/R/pkg/inst/tests/testthat/test_Windows.R
index 1d777dd..919b063 100644
--- a/R/pkg/inst/tests/testthat/test_Windows.R
+++ b/R/pkg/inst/tests/testthat/test_Windows.R
@@ -17,6 +17,8 @@
 context("Windows-specific tests")
 
 test_that("sparkJars tag in SparkContext", {
+  skip_on_cran()
+
   if (.Platform$OS.type != "windows") {
 skip("This test is only for Windows, skipped")
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_binaryFile.R
--
diff --git a/R/pkg/inst/tests/testthat/test_binaryFile.R 
b/R/pkg/inst/tests/testthat/test_binaryFile.R
index b5c279e..63f54e1 100644
--- a/R/pkg/inst/tests/testthat/test_binaryFile.R
+++ b/R/pkg/inst/tests/testthat/test_binaryFile.R
@@ -24,6 +24,8 @@ sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext",
 mockFile <- c("Spark is pretty.", "Spark is awesome.")
 
 test_that("saveAsObjectFile()/objectFile() following textFile() works", {
+  skip_on_cran()
+
   fileName1 <- tempfile(pattern = "spark-test", fileext = ".tmp")
   fileName2 <-

spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 13eb37c86 -> 02bbe7311


[SPARK-20584][PYSPARK][SQL] Python generic hint support

## What changes were proposed in this pull request?

Adds `hint` method to PySpark `DataFrame`.

## How was this patch tested?

Unit tests, doctests.

Author: zero323 

Closes #17850 from zero323/SPARK-20584.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02bbe731
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02bbe731
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02bbe731

Branch: refs/heads/master
Commit: 02bbe73118a39e2fb378aa2002449367a92f6d67
Parents: 13eb37c
Author: zero323 
Authored: Wed May 3 19:15:28 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 19:15:28 2017 -0700

--
 python/pyspark/sql/dataframe.py | 29 +
 python/pyspark/sql/tests.py | 16 
 2 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ab6d35b..7b67985 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -380,6 +380,35 @@ class DataFrame(object):
 jdf = self._jdf.withWatermark(eventTime, delayThreshold)
 return DataFrame(jdf, self.sql_ctx)
 
+@since(2.2)
+def hint(self, name, *parameters):
+"""Specifies some hint on the current DataFrame.
+
+:param name: A name of the hint.
+:param parameters: Optional parameters.
+:return: :class:`DataFrame`
+
+>>> df.join(df2.hint("broadcast"), "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+"""
+if len(parameters) == 1 and isinstance(parameters[0], list):
+parameters = parameters[0]
+
+if not isinstance(name, str):
+raise TypeError("name should be provided as str, got 
{0}".format(type(name)))
+
+for p in parameters:
+if not isinstance(p, str):
+raise TypeError(
+"all parameters should be str, got {0} of type 
{1}".format(p, type(p)))
+
+jdf = self._jdf.hint(name, self._jseq(parameters))
+return DataFrame(jdf, self.sql_ctx)
+
 @since(1.3)
 def count(self):
 """Returns the number of rows in this :class:`DataFrame`.

http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index ce4abf8..f644624 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase):
 # planner should not crash without a join
 broadcast(df1)._jdf.queryExecution().executedPlan()
 
+def test_generic_hints(self):
+from pyspark.sql import DataFrame
+
+df1 = self.spark.range(10e10).toDF("id")
+df2 = self.spark.range(10e10).toDF("id")
+
+self.assertIsInstance(df1.hint("broadcast"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", []), DataFrame)
+
+# Dummy rules
+self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame)
+
+plan = df1.join(df2.hint("broadcast"), 
"id")._jdf.queryExecution().executedPlan()
+self.assertEqual(1, plan.toString().count("BroadcastHashJoin"))
+
 def test_toDF_with_schema_string(self):
 data = [Row(key=i, value=str(i)) for i in range(100)]
 rdd = self.sc.parallelize(data, 5)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a3a5fcfef -> d8bd213f1


[SPARK-20584][PYSPARK][SQL] Python generic hint support

## What changes were proposed in this pull request?

Adds `hint` method to PySpark `DataFrame`.

## How was this patch tested?

Unit tests, doctests.

Author: zero323 

Closes #17850 from zero323/SPARK-20584.

(cherry picked from commit 02bbe73118a39e2fb378aa2002449367a92f6d67)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8bd213f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8bd213f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8bd213f

Branch: refs/heads/branch-2.2
Commit: d8bd213f13279664d50ffa57c1814d0b16fc5d23
Parents: a3a5fcf
Author: zero323 
Authored: Wed May 3 19:15:28 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 19:15:42 2017 -0700

--
 python/pyspark/sql/dataframe.py | 29 +
 python/pyspark/sql/tests.py | 16 
 2 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index f567cc4..d62ba96 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -371,6 +371,35 @@ class DataFrame(object):
 jdf = self._jdf.withWatermark(eventTime, delayThreshold)
 return DataFrame(jdf, self.sql_ctx)
 
+@since(2.2)
+def hint(self, name, *parameters):
+"""Specifies some hint on the current DataFrame.
+
+:param name: A name of the hint.
+:param parameters: Optional parameters.
+:return: :class:`DataFrame`
+
+>>> df.join(df2.hint("broadcast"), "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+"""
+if len(parameters) == 1 and isinstance(parameters[0], list):
+parameters = parameters[0]
+
+if not isinstance(name, str):
+raise TypeError("name should be provided as str, got 
{0}".format(type(name)))
+
+for p in parameters:
+if not isinstance(p, str):
+raise TypeError(
+"all parameters should be str, got {0} of type 
{1}".format(p, type(p)))
+
+jdf = self._jdf.hint(name, self._jseq(parameters))
+return DataFrame(jdf, self.sql_ctx)
+
 @since(1.3)
 def count(self):
 """Returns the number of rows in this :class:`DataFrame`.

http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index cd92148..2aa2d23 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase):
 # planner should not crash without a join
 broadcast(df1)._jdf.queryExecution().executedPlan()
 
+def test_generic_hints(self):
+from pyspark.sql import DataFrame
+
+df1 = self.spark.range(10e10).toDF("id")
+df2 = self.spark.range(10e10).toDF("id")
+
+self.assertIsInstance(df1.hint("broadcast"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", []), DataFrame)
+
+# Dummy rules
+self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame)
+
+plan = df1.join(df2.hint("broadcast"), 
"id")._jdf.queryExecution().executedPlan()
+self.assertEqual(1, plan.toString().count("BroadcastHashJoin"))
+
 def test_toDF_with_schema_string(self):
 data = [Row(key=i, value=str(i)) for i in range(100)]
 rdd = self.sc.parallelize(data, 5)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Preparing development version 2.2.1-SNAPSHOT

2017-05-03 Thread pwendell

Preparing development version 2.2.1-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3a5fcfe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3a5fcfe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3a5fcfe

Branch: refs/heads/branch-2.2
Commit: a3a5fcfefcc25e03496d097b63cd268f61d24c09
Parents: 1d4017b
Author: Patrick Wendell 
Authored: Wed May 3 16:50:12 2017 -0700
Committer: Patrick Wendell 
Committed: Wed May 3 16:50:12 2017 -0700

--
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 38 files changed, 39 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 879c1f8..cfa49b9 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.2.0
+Version: 2.2.1
 Title: R Frontend for Apache Spark
 Description: The SparkR package provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),

http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 3a7003f..da7b0c9 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0
+2.2.1-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 5e9ffd1..7577253 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0
+2.2.1-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index c3e10d1..558864a 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0
+2.2.1-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-yarn/pom.xml
--
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 10ea657..70fed65 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0
+2.2.1-SNAPSHOT
 ../../pom.xml

[1/2] spark git commit: Preparing Spark release v2.2.0-rc2

2017-05-03 Thread pwendell

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 2629e7c7a -> a3a5fcfef


Preparing Spark release v2.2.0-rc2


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d4017b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d4017b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d4017b4

Branch: refs/heads/branch-2.2
Commit: 1d4017b44d5e6ad156abeaae6371747f111dd1f9
Parents: 2629e7c
Author: Patrick Wendell 
Authored: Wed May 3 16:50:08 2017 -0700
Committer: Patrick Wendell 
Committed: Wed May 3 16:50:08 2017 -0700

--
 assembly/pom.xml  | 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 2 +-
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 37 files changed, 37 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 9d8607d..3a7003f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.2.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 8657af7..5e9ffd1 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.2.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 24c10fb..c3e10d1 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.2.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-yarn/pom.xml
--
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 5e5a80b..10ea657 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.2.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/sketch/pom.xml
--
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 1356c47..1a1f652 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.2.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/tags/pom.xml

spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 6b9e49d12 -> 13eb37c86


[MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and 
add a test for =!=

## What changes were proposed in this pull request?

This PR proposes three things as below:

- This test looks not testing `<=>` and identical with the test above, `===`. 
So, it removes the test.

  ```diff
  -   test("<=>") {
  - checkAnswer(
  -  testData2.filter($"a" === 1),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
  -
  -checkAnswer(
  -  testData2.filter($"a" === $"b"),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
  -   }
  ```

- Replace the test title from `=!=` to `<=>`. It looks the test actually 
testing `<=>`.

  ```diff
  +  private lazy val nullData = Seq(
  +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
  +
...
  -  test("=!=") {
  +  test("<=>") {
  -val nullData = spark.createDataFrame(sparkContext.parallelize(
  -  Row(1, 1) ::
  -  Row(1, 2) ::
  -  Row(1, null) ::
  -  Row(null, null) :: Nil),
  -  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
  -
 checkAnswer(
   nullData.filter($"b" <=> 1),
...
  ```

- Add the tests for `=!=` which looks not existing.

  ```diff
  +  test("=!=") {
  +checkAnswer(
  +  nullData.filter($"b" =!= 1),
  +  Row(1, 2) :: Nil)
  +
  +checkAnswer(nullData.filter($"b" =!= null), Nil)
  +
  +checkAnswer(
  +  nullData.filter($"a" =!= $"b"),
  +  Row(1, 2) :: Nil)
  +  }
  ```

## How was this patch tested?

Manually running the tests.

Author: hyukjinkwon 

Closes #17842 from HyukjinKwon/minor-test-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13eb37c8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13eb37c8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13eb37c8

Branch: refs/heads/master
Commit: 13eb37c860c8f672d0e9d9065d0333f981db71e3
Parents: 6b9e49d
Author: hyukjinkwon 
Authored: Wed May 3 13:08:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 13:08:25 2017 -0700

--
 .../spark/sql/ColumnExpressionSuite.scala   | 31 +---
 1 file changed, 14 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/13eb37c8/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
index b0f398d..bc708ca 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
   StructType(Seq(StructField("a", BooleanType), StructField("b", 
BooleanType
   }
 
+  private lazy val nullData = Seq(
+(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
+
   test("column names with space") {
 val df = Seq((1, "a")).toDF("name with space", "name.with.dot")
 
@@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
   test("<=>") {
 checkAnswer(
-  testData2.filter($"a" === 1),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
-
-checkAnswer(
-  testData2.filter($"a" === $"b"),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
-  }
-
-  test("=!=") {
-val nullData = spark.createDataFrame(sparkContext.parallelize(
-  Row(1, 1) ::
-  Row(1, 2) ::
-  Row(1, null) ::
-  Row(null, null) :: Nil),
-  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
-
-checkAnswer(
   nullData.filter($"b" <=> 1),
   Row(1, 1) :: Nil)
 
@@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(
   nullData2.filter($"a" <=> null),
   Row(null) :: Nil)
+  }
 
+  test("=!=") {
+checkAnswer(
+  nullData.filter($"b" =!= 1),
+  Row(1, 2) :: Nil)
+
+checkAnswer(nullData.filter($"b" =!= null), Nil)
+
+checkAnswer(
+  nullData.filter($"a" =!= $"b"),
+  Row(1, 2) :: Nil)
   }
 
   test(">") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 36d807906 -> 2629e7c7a


[MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and 
add a test for =!=

## What changes were proposed in this pull request?

This PR proposes three things as below:

- This test looks not testing `<=>` and identical with the test above, `===`. 
So, it removes the test.

  ```diff
  -   test("<=>") {
  - checkAnswer(
  -  testData2.filter($"a" === 1),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
  -
  -checkAnswer(
  -  testData2.filter($"a" === $"b"),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
  -   }
  ```

- Replace the test title from `=!=` to `<=>`. It looks the test actually 
testing `<=>`.

  ```diff
  +  private lazy val nullData = Seq(
  +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
  +
...
  -  test("=!=") {
  +  test("<=>") {
  -val nullData = spark.createDataFrame(sparkContext.parallelize(
  -  Row(1, 1) ::
  -  Row(1, 2) ::
  -  Row(1, null) ::
  -  Row(null, null) :: Nil),
  -  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
  -
 checkAnswer(
   nullData.filter($"b" <=> 1),
...
  ```

- Add the tests for `=!=` which looks not existing.

  ```diff
  +  test("=!=") {
  +checkAnswer(
  +  nullData.filter($"b" =!= 1),
  +  Row(1, 2) :: Nil)
  +
  +checkAnswer(nullData.filter($"b" =!= null), Nil)
  +
  +checkAnswer(
  +  nullData.filter($"a" =!= $"b"),
  +  Row(1, 2) :: Nil)
  +  }
  ```

## How was this patch tested?

Manually running the tests.

Author: hyukjinkwon 

Closes #17842 from HyukjinKwon/minor-test-fix.

(cherry picked from commit 13eb37c860c8f672d0e9d9065d0333f981db71e3)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2629e7c7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2629e7c7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2629e7c7

Branch: refs/heads/branch-2.2
Commit: 2629e7c7a1dacfb267d866cf825fa8a078612462
Parents: 36d8079
Author: hyukjinkwon 
Authored: Wed May 3 13:08:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 13:08:31 2017 -0700

--
 .../spark/sql/ColumnExpressionSuite.scala   | 31 +---
 1 file changed, 14 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2629e7c7/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
index b0f398d..bc708ca 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
   StructType(Seq(StructField("a", BooleanType), StructField("b", 
BooleanType
   }
 
+  private lazy val nullData = Seq(
+(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
+
   test("column names with space") {
 val df = Seq((1, "a")).toDF("name with space", "name.with.dot")
 
@@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
   test("<=>") {
 checkAnswer(
-  testData2.filter($"a" === 1),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
-
-checkAnswer(
-  testData2.filter($"a" === $"b"),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
-  }
-
-  test("=!=") {
-val nullData = spark.createDataFrame(sparkContext.parallelize(
-  Row(1, 1) ::
-  Row(1, 2) ::
-  Row(1, null) ::
-  Row(null, null) :: Nil),
-  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
-
-checkAnswer(
   nullData.filter($"b" <=> 1),
   Row(1, 1) :: Nil)
 
@@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(
   nullData2.filter($"a" <=> null),
   Row(null) :: Nil)
+  }
 
+  test("=!=") {
+checkAnswer(
+  nullData.filter($"b" =!= 1),
+  Row(1, 2) :: Nil)
+
+checkAnswer(nullData.filter($"b" =!= null), Nil)
+
+checkAnswer(
+  nullData.filter($"a" =!= $"b"),
+  Row(1, 2) :: Nil)
   }
 
   test(">") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For

spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-05-03 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 527fc5d0c -> 6b9e49d12


[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when 
reading FileStreamSink's output

## The Problem

Right now DataFrame batch reader may fail to infer partitions when reading 
FileStreamSink's output:

```
[info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
(3 seconds, 928 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Conflicting directory 
structures detected. Suspicious paths:
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
[info]
[info] If provided paths are partition directories, please set "basePath" in 
the options of the data source to specify the root directory of the table. If 
there are multiple root directories, please load them separately and then union 
them.
[info]   at scala.Predef$.assert(Predef.scala:170)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
[info]   at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
[info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
```

## What changes were proposed in this pull request?

This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose 
ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following 
and other similar dir or files will be filtered out:
- (introduced by globbing `basePath/*`)
   - `basePath/_spark_metadata`
- (introduced by globbing `basePath/*/*`)
   - `basePath/_spark_metadata/0`
   - `basePath/_spark_metadata/1`
   - ...

## How was this patch tested?

Added unit tests

Author: Liwei Lin 

Closes #17346 from lw-lin/filter-metadata.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b9e49d1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b9e49d1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b9e49d1

Branch: refs/heads/master
Commit: 6b9e49d12fc4c9b29d497122daa4cc9bf4540b16
Parents: 527fc5d
Author: Liwei Lin 
Authored: Wed May 3 11:10:24 2017 -0700
Committer: Shixiong Zhu 
Committed: Wed May 3 11:10:24 2017 -0700

--
 .../datasources/InMemoryFileIndex.scala | 13 -
 .../execution/streaming/FileStreamSink.scala| 20 +++
 .../datasources/FileSourceStrategySuite.scala   |  2 +-
 .../sql/streaming/FileStreamSinkSuite.scala | 59 +++-
 4 files changed, 90 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b9e49d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
index 9897ab7..91e3165 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
@@ -27,6 +27,7 @@ import org.apache.hadoop.mapred.{FileInputFormat, JobConf}
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.execution.streaming.FileStreamSink
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.types.StructType
 import

spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-05-03 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 f0e80aa2d -> 36d807906


[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when 
reading FileStreamSink's output

## The Problem

Right now DataFrame batch reader may fail to infer partitions when reading 
FileStreamSink's output:

```
[info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
(3 seconds, 928 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Conflicting directory 
structures detected. Suspicious paths:
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
[info]
[info] If provided paths are partition directories, please set "basePath" in 
the options of the data source to specify the root directory of the table. If 
there are multiple root directories, please load them separately and then union 
them.
[info]   at scala.Predef$.assert(Predef.scala:170)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
[info]   at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
[info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
```

## What changes were proposed in this pull request?

This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose 
ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following 
and other similar dir or files will be filtered out:
- (introduced by globbing `basePath/*`)
   - `basePath/_spark_metadata`
- (introduced by globbing `basePath/*/*`)
   - `basePath/_spark_metadata/0`
   - `basePath/_spark_metadata/1`
   - ...

## How was this patch tested?

Added unit tests

Author: Liwei Lin 

Closes #17346 from lw-lin/filter-metadata.

(cherry picked from commit 6b9e49d12fc4c9b29d497122daa4cc9bf4540b16)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36d80790
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36d80790
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36d80790

Branch: refs/heads/branch-2.2
Commit: 36d80790699c529b15e9c1a2cf2f9f636b1f24e6
Parents: f0e80aa
Author: Liwei Lin 
Authored: Wed May 3 11:10:24 2017 -0700
Committer: Shixiong Zhu 
Committed: Wed May 3 11:10:31 2017 -0700

--
 .../datasources/InMemoryFileIndex.scala | 13 -
 .../execution/streaming/FileStreamSink.scala| 20 +++
 .../datasources/FileSourceStrategySuite.scala   |  2 +-
 .../sql/streaming/FileStreamSinkSuite.scala | 59 +++-
 4 files changed, 90 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/36d80790/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
index 9897ab7..91e3165 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
@@ -27,6 +27,7 @@ import org.apache.hadoop.mapred.{FileInputFormat, JobConf}
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.HiveCatalogMetrics
+import

spark-website git commit: trigger resync

2017-05-03 Thread marmbrus

Repository: spark-website
Updated Branches:
  refs/heads/asf-site d4f0c34ac -> 7b32b181f


trigger resync


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7b32b181
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7b32b181
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7b32b181

Branch: refs/heads/asf-site
Commit: 7b32b181fd554755eab39658c79e56b3ad1b4334
Parents: d4f0c34
Author: Michael Armbrust 
Authored: Wed May 3 10:27:05 2017 -0700
Committer: Michael Armbrust 
Committed: Wed May 3 10:27:05 2017 -0700

--
 news/_posts/2017-05-02-spark-2-1-1-released.md | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/7b32b181/news/_posts/2017-05-02-spark-2-1-1-released.md
--
diff --git a/news/_posts/2017-05-02-spark-2-1-1-released.md 
b/news/_posts/2017-05-02-spark-2-1-1-released.md
index fe72279..3dd12f2 100644
--- a/news/_posts/2017-05-02-spark-2-1-1-released.md
+++ b/news/_posts/2017-05-02-spark-2-1-1-released.md
@@ -12,3 +12,4 @@ meta:
   _wpas_done_all: '1'
 ---
 We are happy to announce the availability of Apache Spark 2.1.1! Visit the release notes to read about the changes, or download the release today.
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-website] Git Push Summary

2017-05-03 Thread marmbrus

Repository: spark-website
Updated Branches:
  refs/heads/spark-2.1.1 [created] d4f0c34ac

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b1a732fea -> f0e80aa2d


[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in 
SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), 
DataFrame doesn't have one and sometimes users are confused that they can't 
find how to apply a broadcast hint. This ticket adds a generic hint function on 
DataFrame that allows using the same hint on DataFrames as well as SQL.

As an example, after this patch, the following will apply a broadcast hint on a 
DataFrame using the new hint function:

```
df1.join(df2.hint("broadcast"))
```

## How was this patch tested?
Added a test case in DataFrameJoinSuite.

Author: Reynold Xin 

Closes #17839 from rxin/SPARK-20576.

(cherry picked from commit 527fc5d0c990daaacad4740f62cfe6736609b77b)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0e80aa2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0e80aa2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0e80aa2

Branch: refs/heads/branch-2.2
Commit: f0e80aa2ddee80819ef33ee24eb6a15a73bc02d5
Parents: b1a732f
Author: Reynold Xin 
Authored: Wed May 3 09:22:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 09:22:41 2017 -0700

--
 .../sql/catalyst/analysis/ResolveHints.scala  |  8 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala | 16 
 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +-
 3 files changed, 40 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
index c4827b8..df688fa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
@@ -86,7 +86,13 @@ object ResolveHints {
 
 def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case h: Hint if 
BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
-applyBroadcastHint(h.child, h.parameters.toSet)
+if (h.parameters.isEmpty) {
+  // If there is no table alias specified, turn the entire subtree 
into a BroadcastHint.
+  BroadcastHint(h.child)
+} else {
+  // Otherwise, find within the subtree query plans that should be 
broadcasted.
+  applyBroadcastHint(h.child, h.parameters.toSet)
+}
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 06dd550..5f602dc 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1074,6 +1074,22 @@ class Dataset[T] private[sql](
   def apply(colName: String): Column = col(colName)
 
   /**
+   * Specifies some hint on the current Dataset. As an example, the following 
code specifies
+   * that one of the plan can be broadcasted:
+   *
+   * {{{
+   *   df1.join(df2.hint("broadcast"))
+   * }}}
+   *
+   * @group basic
+   * @since 2.2.0
+   */
+  @scala.annotation.varargs
+  def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan {
+Hint(name, parameters, logicalPlan)
+  }
+
+  /**
* Selects column based on the column name and return it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.

http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
index 541ffb5..4a52af6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
@@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with

spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

2017-05-03 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 27f543b15 -> 527fc5d0c


[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in 
SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), 
DataFrame doesn't have one and sometimes users are confused that they can't 
find how to apply a broadcast hint. This ticket adds a generic hint function on 
DataFrame that allows using the same hint on DataFrames as well as SQL.

As an example, after this patch, the following will apply a broadcast hint on a 
DataFrame using the new hint function:

```
df1.join(df2.hint("broadcast"))
```

## How was this patch tested?
Added a test case in DataFrameJoinSuite.

Author: Reynold Xin 

Closes #17839 from rxin/SPARK-20576.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/527fc5d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/527fc5d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/527fc5d0

Branch: refs/heads/master
Commit: 527fc5d0c990daaacad4740f62cfe6736609b77b
Parents: 27f543b
Author: Reynold Xin 
Authored: Wed May 3 09:22:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 09:22:25 2017 -0700

--
 .../sql/catalyst/analysis/ResolveHints.scala  |  8 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala | 16 
 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +-
 3 files changed, 40 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
index c4827b8..df688fa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
@@ -86,7 +86,13 @@ object ResolveHints {
 
 def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case h: Hint if 
BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
-applyBroadcastHint(h.child, h.parameters.toSet)
+if (h.parameters.isEmpty) {
+  // If there is no table alias specified, turn the entire subtree 
into a BroadcastHint.
+  BroadcastHint(h.child)
+} else {
+  // Otherwise, find within the subtree query plans that should be 
broadcasted.
+  applyBroadcastHint(h.child, h.parameters.toSet)
+}
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 147e765..620c8bd 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1161,6 +1161,22 @@ class Dataset[T] private[sql](
   def apply(colName: String): Column = col(colName)
 
   /**
+   * Specifies some hint on the current Dataset. As an example, the following 
code specifies
+   * that one of the plan can be broadcasted:
+   *
+   * {{{
+   *   df1.join(df2.hint("broadcast"))
+   * }}}
+   *
+   * @group basic
+   * @since 2.2.0
+   */
+  @scala.annotation.varargs
+  def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan {
+Hint(name, parameters, logicalPlan)
+  }
+
+  /**
* Selects column based on the column name and return it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.

http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
index 541ffb5..4a52af6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
@@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
   Row(1, 1, 1, 1) :: Row(2, 1, 2, 2) :: Nil)
   }
 
-  test("broadcast join hint") {
+  test("broadcast

spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-05-03 Thread brkyvz

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b5947f5c3 -> b1a732fea


[SPARK-20441][SPARK-20432][SS] Within the same streaming query, one 
StreamingRelation should only be transformed to one StreamingExecutionRelation

## What changes were proposed in this pull request?

Within the same streaming query, when one `StreamingRelation` is referred 
multiple times â e.g. `df.union(df)` â we should transform it only to one 
`StreamingExecutionRelation`, instead of two or more different 
`StreamingExecutionRelation`s (each of which would have a separate set of 
source, source logs, ...).

## How was this patch tested?

Added two test cases, each of which would fail without this patch.

Author: Liwei Lin 

Closes #17735 from lw-lin/SPARK-20441.

(cherry picked from commit 27f543b15f2f493f6f8373e46b4c9564b0a1bf81)
Signed-off-by: Burak Yavuz 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1a732fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1a732fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1a732fe

Branch: refs/heads/branch-2.2
Commit: b1a732fead32a37afcb7cf7a35facc49a449b8e2
Parents: b5947f5
Author: Liwei Lin 
Authored: Wed May 3 08:55:02 2017 -0700
Committer: Burak Yavuz 
Committed: Wed May 3 08:55:17 2017 -0700

--
 .../execution/streaming/StreamExecution.scala   | 20 
 .../spark/sql/streaming/StreamSuite.scala   | 48 
 2 files changed, 60 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b1a732fe/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
index affc201..b6ddf74 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
@@ -23,6 +23,7 @@ import java.util.concurrent.{CountDownLatch, TimeUnit}
 import java.util.concurrent.atomic.AtomicReference
 import java.util.concurrent.locks.ReentrantLock
 
+import scala.collection.mutable.{Map => MutableMap}
 import scala.collection.mutable.ArrayBuffer
 import scala.util.control.NonFatal
 
@@ -148,15 +149,18 @@ class StreamExecution(
   "logicalPlan must be initialized in StreamExecutionThread " +
 s"but the current thread was ${Thread.currentThread}")
 var nextSourceId = 0L
+val toExecutionRelationMap = MutableMap[StreamingRelation, 
StreamingExecutionRelation]()
 val _logicalPlan = analyzedPlan.transform {
-  case StreamingRelation(dataSource, _, output) =>
-// Materialize source to avoid creating it in every batch
-val metadataPath = s"$checkpointRoot/sources/$nextSourceId"
-val source = dataSource.createSource(metadataPath)
-nextSourceId += 1
-// We still need to use the previous `output` instead of 
`source.schema` as attributes in
-// "df.logicalPlan" has already used attributes of the previous 
`output`.
-StreamingExecutionRelation(source, output)
+  case streamingRelation@StreamingRelation(dataSource, _, output) =>
+toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  // Materialize source to avoid creating it in every batch
+  val metadataPath = s"$checkpointRoot/sources/$nextSourceId"
+  val source = dataSource.createSource(metadataPath)
+  nextSourceId += 1
+  // We still need to use the previous `output` instead of 
`source.schema` as attributes in
+  // "df.logicalPlan" has already used attributes of the previous 
`output`.
+  StreamingExecutionRelation(source, output)
+})
 }
 sources = _logicalPlan.collect { case s: StreamingExecutionRelation => 
s.source }
 uniqueSources = sources.distinct

http://git-wip-us.apache.org/repos/asf/spark/blob/b1a732fe/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
index 01ea62a..1fc0629 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
@@ -71,6 +71,27 @@ class StreamSuite extends StreamTest {
   CheckAnswer(Row(1, 1, "one"), Row(2, 2, "two"), Row(4, 4,

spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-05-03 Thread brkyvz

Repository: spark
Updated Branches:
  refs/heads/master 7f96f2d7f -> 27f543b15


[SPARK-20441][SPARK-20432][SS] Within the same streaming query, one 
StreamingRelation should only be transformed to one StreamingExecutionRelation

## What changes were proposed in this pull request?

Within the same streaming query, when one `StreamingRelation` is referred 
multiple times â e.g. `df.union(df)` â we should transform it only to one 
`StreamingExecutionRelation`, instead of two or more different 
`StreamingExecutionRelation`s (each of which would have a separate set of 
source, source logs, ...).

## How was this patch tested?

Added two test cases, each of which would fail without this patch.

Author: Liwei Lin 

Closes #17735 from lw-lin/SPARK-20441.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/27f543b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/27f543b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/27f543b1

Branch: refs/heads/master
Commit: 27f543b15f2f493f6f8373e46b4c9564b0a1bf81
Parents: 7f96f2d
Author: Liwei Lin 
Authored: Wed May 3 08:55:02 2017 -0700
Committer: Burak Yavuz 
Committed: Wed May 3 08:55:02 2017 -0700

--
 .../execution/streaming/StreamExecution.scala   | 20 
 .../spark/sql/streaming/StreamSuite.scala   | 48 
 2 files changed, 60 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/27f543b1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
index affc201..b6ddf74 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
@@ -23,6 +23,7 @@ import java.util.concurrent.{CountDownLatch, TimeUnit}
 import java.util.concurrent.atomic.AtomicReference
 import java.util.concurrent.locks.ReentrantLock
 
+import scala.collection.mutable.{Map => MutableMap}
 import scala.collection.mutable.ArrayBuffer
 import scala.util.control.NonFatal
 
@@ -148,15 +149,18 @@ class StreamExecution(
   "logicalPlan must be initialized in StreamExecutionThread " +
 s"but the current thread was ${Thread.currentThread}")
 var nextSourceId = 0L
+val toExecutionRelationMap = MutableMap[StreamingRelation, 
StreamingExecutionRelation]()
 val _logicalPlan = analyzedPlan.transform {
-  case StreamingRelation(dataSource, _, output) =>
-// Materialize source to avoid creating it in every batch
-val metadataPath = s"$checkpointRoot/sources/$nextSourceId"
-val source = dataSource.createSource(metadataPath)
-nextSourceId += 1
-// We still need to use the previous `output` instead of 
`source.schema` as attributes in
-// "df.logicalPlan" has already used attributes of the previous 
`output`.
-StreamingExecutionRelation(source, output)
+  case streamingRelation@StreamingRelation(dataSource, _, output) =>
+toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  // Materialize source to avoid creating it in every batch
+  val metadataPath = s"$checkpointRoot/sources/$nextSourceId"
+  val source = dataSource.createSource(metadataPath)
+  nextSourceId += 1
+  // We still need to use the previous `output` instead of 
`source.schema` as attributes in
+  // "df.logicalPlan" has already used attributes of the previous 
`output`.
+  StreamingExecutionRelation(source, output)
+})
 }
 sources = _logicalPlan.collect { case s: StreamingExecutionRelation => 
s.source }
 uniqueSources = sources.distinct

http://git-wip-us.apache.org/repos/asf/spark/blob/27f543b1/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
index 01ea62a..1fc0629 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
@@ -71,6 +71,27 @@ class StreamSuite extends StreamTest {
   CheckAnswer(Row(1, 1, "one"), Row(2, 2, "two"), Row(4, 4, "four")))
   }
 
+  test("SPARK-20432: union one stream with itself") {
+val df =

spark git commit: [SPARK-16957][MLLIB] Use midpoints for split values.

2017-05-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 16fab6b0e -> 7f96f2d7f


[SPARK-16957][MLLIB] Use midpoints for split values.

## What changes were proposed in this pull request?

Use midpoints for split values now, and maybe later to make it weighted.

## How was this patch tested?

+ [x] add unit test.
+ [x] revise Split's unit test.

Author: Yan Facai (é¢åæ) 
Author: é¢åæï¼Yan Facaiï¼ 

Closes #17556 from 
facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f96f2d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f96f2d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f96f2d7

Branch: refs/heads/master
Commit: 7f96f2d7f2d5abf81dd7f8ca27fea35cf798fd65
Parents: 16fab6b
Author: Yan Facai (é¢åæ) 
Authored: Wed May 3 10:54:40 2017 +0100
Committer: Sean Owen 
Committed: Wed May 3 10:54:40 2017 +0100

--
 .../spark/ml/tree/impl/RandomForest.scala   | 15 ---
 .../spark/ml/tree/impl/RandomForestSuite.scala  | 41 +---
 python/pyspark/mllib/tree.py| 12 +++---
 3 files changed, 51 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7f96f2d7/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
index 008dd19..82e1ed8 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
@@ -996,7 +996,7 @@ private[spark] object RandomForest extends Logging {
 require(metadata.isContinuous(featureIndex),
   "findSplitsForContinuousFeature can only be used to find splits for a 
continuous feature.")
 
-val splits = if (featureSamples.isEmpty) {
+val splits: Array[Double] = if (featureSamples.isEmpty) {
   Array.empty[Double]
 } else {
   val numSplits = metadata.numSplits(featureIndex)
@@ -1009,10 +1009,15 @@ private[spark] object RandomForest extends Logging {
   // sort distinct values
   val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray
 
-  // if possible splits is not enough or just enough, just return all 
possible splits
   val possibleSplits = valueCounts.length - 1
-  if (possibleSplits <= numSplits) {
-valueCounts.map(_._1).init
+  if (possibleSplits == 0) {
+// constant feature
+Array.empty[Double]
+  } else if (possibleSplits <= numSplits) {
+// if possible splits is not enough or just enough, just return all 
possible splits
+(1 to possibleSplits)
+  .map(index => (valueCounts(index - 1)._1 + valueCounts(index)._1) / 
2.0)
+  .toArray
   } else {
 // stride between splits
 val stride: Double = numSamples.toDouble / (numSplits + 1)
@@ -1037,7 +1042,7 @@ private[spark] object RandomForest extends Logging {
   // makes the gap between currentCount and targetCount smaller,
   // previous value is a split threshold.
   if (previousGap < currentGap) {
-splitsBuilder += valueCounts(index - 1)._1
+splitsBuilder += (valueCounts(index - 1)._1 + 
valueCounts(index)._1) / 2.0
 targetCount += stride
   }
   index += 1

http://git-wip-us.apache.org/repos/asf/spark/blob/7f96f2d7/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
index e1ab7c2..df155b4 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
@@ -104,6 +104,31 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   assert(splits.distinct.length === splits.length)
 }
 
+// SPARK-16957: Use midpoints for split values.
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+
+  // possibleSplits <= numSplits
+  {
+val featureSamples = Array(0, 1, 0, 0, 1, 0, 1, 1).map(_.toDouble)
+val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+val expectedSplits = Array((0.0

spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

2017-05-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 4f647ab66 -> b5947f5c3


[SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

## What changes were proposed in this pull request?

Fix build warnings primarily related to Breeze 0.13 operator changes, Java 
style problems

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #17803 from srowen/SPARK-20523.

(cherry picked from commit 16fab6b0ef3dcb33f92df30e17680922ad5fb672)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5947f5c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5947f5c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5947f5c

Branch: refs/heads/branch-2.2
Commit: b5947f5c33eb403d65b1c316d1781c3d7cebf01b
Parents: 4f647ab
Author: Sean Owen 
Authored: Wed May 3 10:18:35 2017 +0100
Committer: Sean Owen 
Committed: Wed May 3 10:18:48 2017 +0100

--
 .../apache/spark/network/yarn/YarnShuffleService.java |  4 ++--
 .../main/java/org/apache/spark/unsafe/Platform.java   |  3 ++-
 .../org/apache/spark/memory/TaskMemoryManager.java|  3 ++-
 .../apache/spark/scheduler/TaskSetManagerSuite.scala  | 11 ++-
 .../spark/storage/BlockReplicationPolicySuite.scala   |  1 +
 dev/checkstyle-suppressions.xml   |  4 
 .../sql/streaming/JavaStructuredSessionization.java   |  2 --
 .../scala/org/apache/spark/graphx/lib/PageRank.scala  | 14 +++---
 .../scala/org/apache/spark/ml/ann/LossFunction.scala  |  4 ++--
 .../apache/spark/ml/clustering/GaussianMixture.scala  |  2 +-
 .../spark/mllib/clustering/GaussianMixture.scala  |  2 +-
 .../org/apache/spark/mllib/clustering/LDAModel.scala  |  8 
 .../apache/spark/mllib/clustering/LDAOptimizer.scala  | 12 ++--
 .../org/apache/spark/mllib/clustering/LDAUtils.scala  |  2 +-
 .../spark/ml/classification/NaiveBayesSuite.scala |  2 +-
 pom.xml   |  4 
 .../scheduler/cluster/YarnSchedulerBackendSuite.scala |  2 ++
 .../apache/spark/sql/streaming/GroupStateTimeout.java |  5 -
 .../catalyst/expressions/JsonExpressionsSuite.scala   |  2 +-
 .../parquet/SpecificParquetRecordReaderBase.java  |  5 +++--
 .../spark/sql/execution/QueryExecutionSuite.scala |  2 ++
 .../sql/streaming/StreamingQueryListenerSuite.scala   |  1 +
 .../spark/sql/hive/execution/HiveDDLSuite.scala   |  2 +-
 23 files changed, 54 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b5947f5c/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
--
diff --git 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
index 4acc203..fd50e3a 100644
--- 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
+++ 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@@ -363,9 +363,9 @@ public class YarnShuffleService extends AuxiliaryService {
   // If another DB was initialized first just make sure all the DBs 
are in the same
   // location.
   Path newLoc = new Path(_recoveryPath, dbName);
-  Path copyFrom = new Path(f.toURI()); 
+  Path copyFrom = new Path(f.toURI());
   if (!newLoc.equals(copyFrom)) {
-logger.info("Moving " + copyFrom + " to: " + newLoc); 
+logger.info("Moving " + copyFrom + " to: " + newLoc);
 try {
   // The move here needs to handle moving non-empty directories 
across NFS mounts
   FileSystem fs = FileSystem.getLocal(_conf);

http://git-wip-us.apache.org/repos/asf/spark/blob/b5947f5c/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
--
diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
index 1321b83..4ab5b68 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
@@ -48,7 +48,8 @@ public final class Platform {
 boolean _unaligned;
 String arch = System.getProperty("os.arch", "");
 if (arch.equals("ppc64le") || arch.equals("ppc64")) {
-  // Since java.nio.Bits.unaligned() doesn't return true on ppc (See 
JDK-8165231), but ppc64 and ppc64le support it
+  // Since java.nio.Bits.unaligned() doesn't return true on ppc (See

spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

2017-05-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master db2fb84b4 -> 16fab6b0e


[SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

## What changes were proposed in this pull request?

Fix build warnings primarily related to Breeze 0.13 operator changes, Java 
style problems

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #17803 from srowen/SPARK-20523.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16fab6b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16fab6b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16fab6b0

Branch: refs/heads/master
Commit: 16fab6b0ef3dcb33f92df30e17680922ad5fb672
Parents: db2fb84
Author: Sean Owen 
Authored: Wed May 3 10:18:35 2017 +0100
Committer: Sean Owen 
Committed: Wed May 3 10:18:35 2017 +0100

--
 .../apache/spark/network/yarn/YarnShuffleService.java |  4 ++--
 .../main/java/org/apache/spark/unsafe/Platform.java   |  3 ++-
 .../org/apache/spark/memory/TaskMemoryManager.java|  3 ++-
 .../apache/spark/scheduler/TaskSetManagerSuite.scala  | 11 ++-
 .../spark/storage/BlockReplicationPolicySuite.scala   |  1 +
 dev/checkstyle-suppressions.xml   |  4 
 .../sql/streaming/JavaStructuredSessionization.java   |  2 --
 .../scala/org/apache/spark/graphx/lib/PageRank.scala  | 14 +++---
 .../scala/org/apache/spark/ml/ann/LossFunction.scala  |  4 ++--
 .../apache/spark/ml/clustering/GaussianMixture.scala  |  2 +-
 .../spark/mllib/clustering/GaussianMixture.scala  |  2 +-
 .../org/apache/spark/mllib/clustering/LDAModel.scala  |  8 
 .../apache/spark/mllib/clustering/LDAOptimizer.scala  | 12 ++--
 .../org/apache/spark/mllib/clustering/LDAUtils.scala  |  2 +-
 .../spark/ml/classification/NaiveBayesSuite.scala |  2 +-
 pom.xml   |  4 
 .../scheduler/cluster/YarnSchedulerBackendSuite.scala |  2 ++
 .../apache/spark/sql/streaming/GroupStateTimeout.java |  5 -
 .../catalyst/expressions/JsonExpressionsSuite.scala   |  2 +-
 .../parquet/SpecificParquetRecordReaderBase.java  |  5 +++--
 .../spark/sql/execution/QueryExecutionSuite.scala |  2 ++
 .../sql/streaming/StreamingQueryListenerSuite.scala   |  1 +
 .../spark/sql/hive/execution/HiveDDLSuite.scala   |  2 +-
 23 files changed, 54 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16fab6b0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
--
diff --git 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
index 4acc203..fd50e3a 100644
--- 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
+++ 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@@ -363,9 +363,9 @@ public class YarnShuffleService extends AuxiliaryService {
   // If another DB was initialized first just make sure all the DBs 
are in the same
   // location.
   Path newLoc = new Path(_recoveryPath, dbName);
-  Path copyFrom = new Path(f.toURI()); 
+  Path copyFrom = new Path(f.toURI());
   if (!newLoc.equals(copyFrom)) {
-logger.info("Moving " + copyFrom + " to: " + newLoc); 
+logger.info("Moving " + copyFrom + " to: " + newLoc);
 try {
   // The move here needs to handle moving non-empty directories 
across NFS mounts
   FileSystem fs = FileSystem.getLocal(_conf);

http://git-wip-us.apache.org/repos/asf/spark/blob/16fab6b0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
--
diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
index 1321b83..4ab5b68 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
@@ -48,7 +48,8 @@ public final class Platform {
 boolean _unaligned;
 String arch = System.getProperty("os.arch", "");
 if (arch.equals("ppc64le") || arch.equals("ppc64")) {
-  // Since java.nio.Bits.unaligned() doesn't return true on ppc (See 
JDK-8165231), but ppc64 and ppc64le support it
+  // Since java.nio.Bits.unaligned() doesn't return true on ppc (See 
JDK-8165231), but 
+  // ppc64 and ppc64le support it
   _unaligned = true;
 } else {
   try {

spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

2017-05-03 Thread mlnick

Repository: spark
Updated Branches:
  refs/heads/master 6235132a8 -> db2fb84b4


[SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` 
(SVD only).

Based on #7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder 
Author: Nick Pentreath 

Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db2fb84b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db2fb84b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db2fb84b

Branch: refs/heads/master
Commit: db2fb84b4a3c45daa449cc9232340193ce8eb37d
Parents: 6235132
Author: MechCoder 
Authored: Wed May 3 10:58:05 2017 +0200
Committer: Nick Pentreath 
Committed: Wed May 3 10:58:05 2017 +0200

--
 docs/mllib-dimensionality-reduction.md  |  29 +--
 .../spark/examples/mllib/JavaPCAExample.java|  27 ++-
 .../spark/examples/mllib/JavaSVDExample.java|  27 +--
 .../main/python/mllib/pca_rowmatrix_example.py  |  46 +
 examples/src/main/python/mllib/svd_example.py   |  48 +
 .../examples/mllib/PCAOnRowMatrixExample.scala  |   4 +-
 .../spark/examples/mllib/SVDExample.scala   |  11 +-
 python/pyspark/mllib/linalg/distributed.py  | 199 ++-
 python/pyspark/mllib/tests.py   |  63 ++
 9 files changed, 408 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/db2fb84b/docs/mllib-dimensionality-reduction.md
--
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index 539cbc1..a72680d 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -76,13 +76,14 @@ Refer to the [`SingularValueDecomposition` Java 
docs](api/java/org/apache/spark/
 
 The same code applies to `IndexedRowMatrix` if `U` is defined as an
 `IndexedRowMatrix`.
+
+
+Refer to the [`SingularValueDecomposition` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition)
 for details on the API.
 
-In order to run the above application, follow the instructions
-provided in the [Self-Contained
-Applications](quick-start.html#self-contained-applications) section of the 
Spark
-quick-start guide. Be sure to also include *spark-mllib* to your build file as
-a dependency.
+{% include_example python/mllib/svd_example.py %}
 
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
 
 
 
@@ -118,17 +119,21 @@ Refer to the [`PCA` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.feat
 
 The following code demonstrates how to compute principal components on a 
`RowMatrix`
 and use them to project the vectors into a low-dimensional space.
-The number of columns should be small, e.g, less than 1000.
 
 Refer to the [`RowMatrix` Java 
docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for 
details on the API.
 
 {% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %}
 
 
-
 
-In order to run the above application, follow the instructions
-provided in the [Self-Contained 
Applications](quick-start.html#self-contained-applications)
-section of the Spark
-quick-start guide. Be sure to also include *spark-mllib* to your build file as
-a dependency.
+
+
+The following code demonstrates how to compute principal components on a 
`RowMatrix`
+and use them to project the vectors into a low-dimensional space.
+
+Refer to the [`RowMatrix` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) 
for details on the API.
+
+{% include_example python/mllib/pca_rowmatrix_example.py %}
+
+
+

http://git-wip-us.apache.org/repos/asf/spark/blob/db2fb84b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java 
b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
index 3077f55..0a7dc62 100644
--- a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
@@ -18,7 +18,8 @@
 package org.apache.spark.examples.mllib;
 
 // $example on$
-import java.util.LinkedList;
+import java.util.Arrays;
+import java.util.List;
 // $example off$
 
 import org.apache.spark.SparkConf;
@@ -39,21 +40,25 @@ public class JavaPCAExample {

spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

2017-05-03 Thread mlnick

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 c80242ab9 -> 4f647ab66


[SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` 
(SVD only).

Based on #7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder 
Author: Nick Pentreath 

Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.

(cherry picked from commit db2fb84b4a3c45daa449cc9232340193ce8eb37d)
Signed-off-by: Nick Pentreath 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f647ab6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f647ab6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f647ab6

Branch: refs/heads/branch-2.2
Commit: 4f647ab66353b136e4fdf02587ebbd88ce5c5b5f
Parents: c80242a
Author: MechCoder 
Authored: Wed May 3 10:58:05 2017 +0200
Committer: Nick Pentreath 
Committed: Wed May 3 10:58:20 2017 +0200

--
 docs/mllib-dimensionality-reduction.md  |  29 +--
 .../spark/examples/mllib/JavaPCAExample.java|  27 ++-
 .../spark/examples/mllib/JavaSVDExample.java|  27 +--
 .../main/python/mllib/pca_rowmatrix_example.py  |  46 +
 examples/src/main/python/mllib/svd_example.py   |  48 +
 .../examples/mllib/PCAOnRowMatrixExample.scala  |   4 +-
 .../spark/examples/mllib/SVDExample.scala   |  11 +-
 python/pyspark/mllib/linalg/distributed.py  | 199 ++-
 python/pyspark/mllib/tests.py   |  63 ++
 9 files changed, 408 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4f647ab6/docs/mllib-dimensionality-reduction.md
--
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index 539cbc1..a72680d 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -76,13 +76,14 @@ Refer to the [`SingularValueDecomposition` Java 
docs](api/java/org/apache/spark/
 
 The same code applies to `IndexedRowMatrix` if `U` is defined as an
 `IndexedRowMatrix`.
+
+
+Refer to the [`SingularValueDecomposition` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition)
 for details on the API.
 
-In order to run the above application, follow the instructions
-provided in the [Self-Contained
-Applications](quick-start.html#self-contained-applications) section of the 
Spark
-quick-start guide. Be sure to also include *spark-mllib* to your build file as
-a dependency.
+{% include_example python/mllib/svd_example.py %}
 
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
 
 
 
@@ -118,17 +119,21 @@ Refer to the [`PCA` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.feat
 
 The following code demonstrates how to compute principal components on a 
`RowMatrix`
 and use them to project the vectors into a low-dimensional space.
-The number of columns should be small, e.g, less than 1000.
 
 Refer to the [`RowMatrix` Java 
docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for 
details on the API.
 
 {% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %}
 
 
-
 
-In order to run the above application, follow the instructions
-provided in the [Self-Contained 
Applications](quick-start.html#self-contained-applications)
-section of the Spark
-quick-start guide. Be sure to also include *spark-mllib* to your build file as
-a dependency.
+
+
+The following code demonstrates how to compute principal components on a 
`RowMatrix`
+and use them to project the vectors into a low-dimensional space.
+
+Refer to the [`RowMatrix` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) 
for details on the API.
+
+{% include_example python/mllib/pca_rowmatrix_example.py %}
+
+
+

http://git-wip-us.apache.org/repos/asf/spark/blob/4f647ab6/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java 
b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
index 3077f55..0a7dc62 100644
--- a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java
@@ -18,7 +18,8 @@
 package org.apache.spark.examples.mllib;
 
 // $example on$
-import java.util.LinkedList;
+import java.util.Arrays;
+import

spark git commit: [SPARK-20544][SPARKR] skip tests when running on CRAN

spark git commit: [SPARK-20543][SPARKR] skip tests when running on CRAN

spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

[2/2] spark git commit: Preparing development version 2.2.1-SNAPSHOT

[1/2] spark git commit: Preparing Spark release v2.2.0-rc2

spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

spark-website git commit: trigger resync

[spark-website] Git Push Summary

spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

spark git commit: [SPARK-16957][MLLIB] Use midpoints for split values.

spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release

spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)

21 matches

Site Navigation

Mail list logo

Footer information