spark git commit: [SPARK-20544][SPARKR] skip tests when running on CRAN
Repository: spark Updated Branches: refs/heads/branch-2.2 d8bd213f1 -> 5fe9313d7 [SPARK-20544][SPARKR] skip tests when running on CRAN General rule on skip or not: skip if - RDD tests - tests could run long or complicated (streaming, hivecontext) - tests on error conditions - tests won't likely change/break unit tests, `R CMD check --as-cran`, `R CMD check` Author: Felix CheungCloses #17817 from felixcheung/rskiptest. (cherry picked from commit fc472bddd1d9c6a28e57e31496c0166777af597e) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5fe9313d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5fe9313d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5fe9313d Branch: refs/heads/branch-2.2 Commit: 5fe9313d7c81679981000b8aea5ea4668a0a0bc8 Parents: d8bd213 Author: Felix Cheung Authored: Wed May 3 21:40:18 2017 -0700 Committer: Felix Cheung Committed: Wed May 3 21:51:33 2017 -0700 -- R/pkg/inst/tests/testthat/test_Serde.R | 6 ++ R/pkg/inst/tests/testthat/test_Windows.R| 2 + R/pkg/inst/tests/testthat/test_binaryFile.R | 8 ++ .../inst/tests/testthat/test_binary_function.R | 6 ++ R/pkg/inst/tests/testthat/test_broadcast.R | 4 + R/pkg/inst/tests/testthat/test_client.R | 8 ++ R/pkg/inst/tests/testthat/test_context.R| 16 +++ R/pkg/inst/tests/testthat/test_includePackage.R | 4 + .../inst/tests/testthat/test_mllib_clustering.R | 4 + .../inst/tests/testthat/test_mllib_regression.R | 12 +++ .../tests/testthat/test_parallelize_collect.R | 8 ++ R/pkg/inst/tests/testthat/test_rdd.R| 106 ++- R/pkg/inst/tests/testthat/test_shuffle.R| 24 + R/pkg/inst/tests/testthat/test_sparkR.R | 2 + R/pkg/inst/tests/testthat/test_sparkSQL.R | 60 +++ R/pkg/inst/tests/testthat/test_streaming.R | 12 +++ R/pkg/inst/tests/testthat/test_take.R | 2 + R/pkg/inst/tests/testthat/test_textFile.R | 18 R/pkg/inst/tests/testthat/test_utils.R | 5 + R/run-tests.sh | 2 +- 20 files changed, 306 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_Serde.R -- diff --git a/R/pkg/inst/tests/testthat/test_Serde.R b/R/pkg/inst/tests/testthat/test_Serde.R index b5f6f1b..518fb7b 100644 --- a/R/pkg/inst/tests/testthat/test_Serde.R +++ b/R/pkg/inst/tests/testthat/test_Serde.R @@ -20,6 +20,8 @@ context("SerDe functionality") sparkSession <- sparkR.session(enableHiveSupport = FALSE) test_that("SerDe of primitive types", { + skip_on_cran() + x <- callJStatic("SparkRHandler", "echo", 1L) expect_equal(x, 1L) expect_equal(class(x), "integer") @@ -38,6 +40,8 @@ test_that("SerDe of primitive types", { }) test_that("SerDe of list of primitive types", { + skip_on_cran() + x <- list(1L, 2L, 3L) y <- callJStatic("SparkRHandler", "echo", x) expect_equal(x, y) @@ -65,6 +69,8 @@ test_that("SerDe of list of primitive types", { }) test_that("SerDe of list of lists", { + skip_on_cran() + x <- list(list(1L, 2L, 3L), list(1, 2, 3), list(TRUE, FALSE), list("a", "b", "c")) y <- callJStatic("SparkRHandler", "echo", x) http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_Windows.R -- diff --git a/R/pkg/inst/tests/testthat/test_Windows.R b/R/pkg/inst/tests/testthat/test_Windows.R index 1d777dd..919b063 100644 --- a/R/pkg/inst/tests/testthat/test_Windows.R +++ b/R/pkg/inst/tests/testthat/test_Windows.R @@ -17,6 +17,8 @@ context("Windows-specific tests") test_that("sparkJars tag in SparkContext", { + skip_on_cran() + if (.Platform$OS.type != "windows") { skip("This test is only for Windows, skipped") } http://git-wip-us.apache.org/repos/asf/spark/blob/5fe9313d/R/pkg/inst/tests/testthat/test_binaryFile.R -- diff --git a/R/pkg/inst/tests/testthat/test_binaryFile.R b/R/pkg/inst/tests/testthat/test_binaryFile.R index b5c279e..63f54e1 100644 --- a/R/pkg/inst/tests/testthat/test_binaryFile.R +++ b/R/pkg/inst/tests/testthat/test_binaryFile.R @@ -24,6 +24,8 @@ sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", mockFile <- c("Spark is pretty.", "Spark is awesome.") test_that("saveAsObjectFile()/objectFile() following textFile() works", { + skip_on_cran() + fileName1 <- tempfile(pattern = "spark-test",
spark git commit: [SPARK-20543][SPARKR] skip tests when running on CRAN
Repository: spark Updated Branches: refs/heads/master 02bbe7311 -> fc472bddd [SPARK-20543][SPARKR] skip tests when running on CRAN ## What changes were proposed in this pull request? General rule on skip or not: skip if - RDD tests - tests could run long or complicated (streaming, hivecontext) - tests on error conditions - tests won't likely change/break ## How was this patch tested? unit tests, `R CMD check --as-cran`, `R CMD check` Author: Felix CheungCloses #17817 from felixcheung/rskiptest. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc472bdd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc472bdd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc472bdd Branch: refs/heads/master Commit: fc472bddd1d9c6a28e57e31496c0166777af597e Parents: 02bbe73 Author: Felix Cheung Authored: Wed May 3 21:40:18 2017 -0700 Committer: Felix Cheung Committed: Wed May 3 21:40:18 2017 -0700 -- R/pkg/inst/tests/testthat/test_Serde.R | 6 ++ R/pkg/inst/tests/testthat/test_Windows.R| 2 + R/pkg/inst/tests/testthat/test_binaryFile.R | 8 ++ .../inst/tests/testthat/test_binary_function.R | 6 ++ R/pkg/inst/tests/testthat/test_broadcast.R | 4 + R/pkg/inst/tests/testthat/test_client.R | 8 ++ R/pkg/inst/tests/testthat/test_context.R| 16 +++ R/pkg/inst/tests/testthat/test_includePackage.R | 4 + .../inst/tests/testthat/test_mllib_clustering.R | 4 + .../inst/tests/testthat/test_mllib_regression.R | 12 +++ .../tests/testthat/test_parallelize_collect.R | 8 ++ R/pkg/inst/tests/testthat/test_rdd.R| 106 ++- R/pkg/inst/tests/testthat/test_shuffle.R| 24 + R/pkg/inst/tests/testthat/test_sparkR.R | 2 + R/pkg/inst/tests/testthat/test_sparkSQL.R | 61 ++- R/pkg/inst/tests/testthat/test_streaming.R | 12 +++ R/pkg/inst/tests/testthat/test_take.R | 2 + R/pkg/inst/tests/testthat/test_textFile.R | 18 R/pkg/inst/tests/testthat/test_utils.R | 6 ++ R/run-tests.sh | 2 +- 20 files changed, 307 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_Serde.R -- diff --git a/R/pkg/inst/tests/testthat/test_Serde.R b/R/pkg/inst/tests/testthat/test_Serde.R index b5f6f1b..518fb7b 100644 --- a/R/pkg/inst/tests/testthat/test_Serde.R +++ b/R/pkg/inst/tests/testthat/test_Serde.R @@ -20,6 +20,8 @@ context("SerDe functionality") sparkSession <- sparkR.session(enableHiveSupport = FALSE) test_that("SerDe of primitive types", { + skip_on_cran() + x <- callJStatic("SparkRHandler", "echo", 1L) expect_equal(x, 1L) expect_equal(class(x), "integer") @@ -38,6 +40,8 @@ test_that("SerDe of primitive types", { }) test_that("SerDe of list of primitive types", { + skip_on_cran() + x <- list(1L, 2L, 3L) y <- callJStatic("SparkRHandler", "echo", x) expect_equal(x, y) @@ -65,6 +69,8 @@ test_that("SerDe of list of primitive types", { }) test_that("SerDe of list of lists", { + skip_on_cran() + x <- list(list(1L, 2L, 3L), list(1, 2, 3), list(TRUE, FALSE), list("a", "b", "c")) y <- callJStatic("SparkRHandler", "echo", x) http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_Windows.R -- diff --git a/R/pkg/inst/tests/testthat/test_Windows.R b/R/pkg/inst/tests/testthat/test_Windows.R index 1d777dd..919b063 100644 --- a/R/pkg/inst/tests/testthat/test_Windows.R +++ b/R/pkg/inst/tests/testthat/test_Windows.R @@ -17,6 +17,8 @@ context("Windows-specific tests") test_that("sparkJars tag in SparkContext", { + skip_on_cran() + if (.Platform$OS.type != "windows") { skip("This test is only for Windows, skipped") } http://git-wip-us.apache.org/repos/asf/spark/blob/fc472bdd/R/pkg/inst/tests/testthat/test_binaryFile.R -- diff --git a/R/pkg/inst/tests/testthat/test_binaryFile.R b/R/pkg/inst/tests/testthat/test_binaryFile.R index b5c279e..63f54e1 100644 --- a/R/pkg/inst/tests/testthat/test_binaryFile.R +++ b/R/pkg/inst/tests/testthat/test_binaryFile.R @@ -24,6 +24,8 @@ sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", mockFile <- c("Spark is pretty.", "Spark is awesome.") test_that("saveAsObjectFile()/objectFile() following textFile() works", { + skip_on_cran() + fileName1 <- tempfile(pattern = "spark-test", fileext = ".tmp") fileName2 <-
spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Repository: spark Updated Branches: refs/heads/master 13eb37c86 -> 02bbe7311 [SPARK-20584][PYSPARK][SQL] Python generic hint support ## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323Closes #17850 from zero323/SPARK-20584. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02bbe731 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02bbe731 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02bbe731 Branch: refs/heads/master Commit: 02bbe73118a39e2fb378aa2002449367a92f6d67 Parents: 13eb37c Author: zero323 Authored: Wed May 3 19:15:28 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 19:15:28 2017 -0700 -- python/pyspark/sql/dataframe.py | 29 + python/pyspark/sql/tests.py | 16 2 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index ab6d35b..7b67985 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -380,6 +380,35 @@ class DataFrame(object): jdf = self._jdf.withWatermark(eventTime, delayThreshold) return DataFrame(jdf, self.sql_ctx) +@since(2.2) +def hint(self, name, *parameters): +"""Specifies some hint on the current DataFrame. + +:param name: A name of the hint. +:param parameters: Optional parameters. +:return: :class:`DataFrame` + +>>> df.join(df2.hint("broadcast"), "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ +""" +if len(parameters) == 1 and isinstance(parameters[0], list): +parameters = parameters[0] + +if not isinstance(name, str): +raise TypeError("name should be provided as str, got {0}".format(type(name))) + +for p in parameters: +if not isinstance(p, str): +raise TypeError( +"all parameters should be str, got {0} of type {1}".format(p, type(p))) + +jdf = self._jdf.hint(name, self._jseq(parameters)) +return DataFrame(jdf, self.sql_ctx) + @since(1.3) def count(self): """Returns the number of rows in this :class:`DataFrame`. http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index ce4abf8..f644624 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase): # planner should not crash without a join broadcast(df1)._jdf.queryExecution().executedPlan() +def test_generic_hints(self): +from pyspark.sql import DataFrame + +df1 = self.spark.range(10e10).toDF("id") +df2 = self.spark.range(10e10).toDF("id") + +self.assertIsInstance(df1.hint("broadcast"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", []), DataFrame) + +# Dummy rules +self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame) + +plan = df1.join(df2.hint("broadcast"), "id")._jdf.queryExecution().executedPlan() +self.assertEqual(1, plan.toString().count("BroadcastHashJoin")) + def test_toDF_with_schema_string(self): data = [Row(key=i, value=str(i)) for i in range(100)] rdd = self.sc.parallelize(data, 5) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Repository: spark Updated Branches: refs/heads/branch-2.2 a3a5fcfef -> d8bd213f1 [SPARK-20584][PYSPARK][SQL] Python generic hint support ## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323Closes #17850 from zero323/SPARK-20584. (cherry picked from commit 02bbe73118a39e2fb378aa2002449367a92f6d67) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8bd213f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8bd213f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8bd213f Branch: refs/heads/branch-2.2 Commit: d8bd213f13279664d50ffa57c1814d0b16fc5d23 Parents: a3a5fcf Author: zero323 Authored: Wed May 3 19:15:28 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 19:15:42 2017 -0700 -- python/pyspark/sql/dataframe.py | 29 + python/pyspark/sql/tests.py | 16 2 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index f567cc4..d62ba96 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -371,6 +371,35 @@ class DataFrame(object): jdf = self._jdf.withWatermark(eventTime, delayThreshold) return DataFrame(jdf, self.sql_ctx) +@since(2.2) +def hint(self, name, *parameters): +"""Specifies some hint on the current DataFrame. + +:param name: A name of the hint. +:param parameters: Optional parameters. +:return: :class:`DataFrame` + +>>> df.join(df2.hint("broadcast"), "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ +""" +if len(parameters) == 1 and isinstance(parameters[0], list): +parameters = parameters[0] + +if not isinstance(name, str): +raise TypeError("name should be provided as str, got {0}".format(type(name))) + +for p in parameters: +if not isinstance(p, str): +raise TypeError( +"all parameters should be str, got {0} of type {1}".format(p, type(p))) + +jdf = self._jdf.hint(name, self._jseq(parameters)) +return DataFrame(jdf, self.sql_ctx) + @since(1.3) def count(self): """Returns the number of rows in this :class:`DataFrame`. http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index cd92148..2aa2d23 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase): # planner should not crash without a join broadcast(df1)._jdf.queryExecution().executedPlan() +def test_generic_hints(self): +from pyspark.sql import DataFrame + +df1 = self.spark.range(10e10).toDF("id") +df2 = self.spark.range(10e10).toDF("id") + +self.assertIsInstance(df1.hint("broadcast"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", []), DataFrame) + +# Dummy rules +self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame) + +plan = df1.join(df2.hint("broadcast"), "id")._jdf.queryExecution().executedPlan() +self.assertEqual(1, plan.toString().count("BroadcastHashJoin")) + def test_toDF_with_schema_string(self): data = [Row(key=i, value=str(i)) for i in range(100)] rdd = self.sc.parallelize(data, 5) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 2.2.1-SNAPSHOT
Preparing development version 2.2.1-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3a5fcfe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3a5fcfe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3a5fcfe Branch: refs/heads/branch-2.2 Commit: a3a5fcfefcc25e03496d097b63cd268f61d24c09 Parents: 1d4017b Author: Patrick WendellAuthored: Wed May 3 16:50:12 2017 -0700 Committer: Patrick Wendell Committed: Wed May 3 16:50:12 2017 -0700 -- R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 38 files changed, 39 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 879c1f8..cfa49b9 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,6 +1,6 @@ Package: SparkR Type: Package -Version: 2.2.0 +Version: 2.2.1 Title: R Frontend for Apache Spark Description: The SparkR package provides an R Frontend for Apache Spark. Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 3a7003f..da7b0c9 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.2.0 +2.2.1-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 5e9ffd1..7577253 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0 +2.2.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index c3e10d1..558864a 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0 +2.2.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/a3a5fcfe/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 10ea657..70fed65 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0 +2.2.1-SNAPSHOT ../../pom.xml
[1/2] spark git commit: Preparing Spark release v2.2.0-rc2
Repository: spark Updated Branches: refs/heads/branch-2.2 2629e7c7a -> a3a5fcfef Preparing Spark release v2.2.0-rc2 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d4017b4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d4017b4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d4017b4 Branch: refs/heads/branch-2.2 Commit: 1d4017b44d5e6ad156abeaae6371747f111dd1f9 Parents: 2629e7c Author: Patrick WendellAuthored: Wed May 3 16:50:08 2017 -0700 Committer: Patrick Wendell Committed: Wed May 3 16:50:08 2017 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 37 files changed, 37 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 9d8607d..3a7003f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.2.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 8657af7..5e9ffd1 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.2.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 24c10fb..c3e10d1 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.2.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 5e5a80b..10ea657 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.2.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 1356c47..1a1f652 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.2.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/1d4017b4/common/tags/pom.xml
spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=
Repository: spark Updated Branches: refs/heads/master 6b9e49d12 -> 13eb37c86 [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwonCloses #17842 from HyukjinKwon/minor-test-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13eb37c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13eb37c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13eb37c8 Branch: refs/heads/master Commit: 13eb37c860c8f672d0e9d9065d0333f981db71e3 Parents: 6b9e49d Author: hyukjinkwon Authored: Wed May 3 13:08:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 13:08:25 2017 -0700 -- .../spark/sql/ColumnExpressionSuite.scala | 31 +--- 1 file changed, 14 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/13eb37c8/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala index b0f398d..bc708ca 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala @@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { StructType(Seq(StructField("a", BooleanType), StructField("b", BooleanType } + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + test("column names with space") { val df = Seq((1, "a")).toDF("name with space", "name.with.dot") @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { test("<=>") { checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } - - test("=!=") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - -checkAnswer( nullData.filter($"b" <=> 1), Row(1, 1) :: Nil) @@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { checkAnswer( nullData2.filter($"a" <=> null), Row(null) :: Nil) + } + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) } test(">") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=
Repository: spark Updated Branches: refs/heads/branch-2.2 36d807906 -> 2629e7c7a [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwonCloses #17842 from HyukjinKwon/minor-test-fix. (cherry picked from commit 13eb37c860c8f672d0e9d9065d0333f981db71e3) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2629e7c7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2629e7c7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2629e7c7 Branch: refs/heads/branch-2.2 Commit: 2629e7c7a1dacfb267d866cf825fa8a078612462 Parents: 36d8079 Author: hyukjinkwon Authored: Wed May 3 13:08:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 13:08:31 2017 -0700 -- .../spark/sql/ColumnExpressionSuite.scala | 31 +--- 1 file changed, 14 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2629e7c7/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala index b0f398d..bc708ca 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala @@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { StructType(Seq(StructField("a", BooleanType), StructField("b", BooleanType } + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + test("column names with space") { val df = Seq((1, "a")).toDF("name with space", "name.with.dot") @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { test("<=>") { checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } - - test("=!=") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - -checkAnswer( nullData.filter($"b" <=> 1), Row(1, 1) :: Nil) @@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { checkAnswer( nullData2.filter($"a" <=> null), Row(null) :: Nil) + } + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) } test(">") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For
spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
Repository: spark Updated Branches: refs/heads/master 527fc5d0c -> 6b9e49d12 [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output ## The Problem Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output: ``` [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) ``` ## What changes were proposed in this pull request? This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out: - (introduced by globbing `basePath/*`) - `basePath/_spark_metadata` - (introduced by globbing `basePath/*/*`) - `basePath/_spark_metadata/0` - `basePath/_spark_metadata/1` - ... ## How was this patch tested? Added unit tests Author: Liwei LinCloses #17346 from lw-lin/filter-metadata. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b9e49d1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b9e49d1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b9e49d1 Branch: refs/heads/master Commit: 6b9e49d12fc4c9b29d497122daa4cc9bf4540b16 Parents: 527fc5d Author: Liwei Lin Authored: Wed May 3 11:10:24 2017 -0700 Committer: Shixiong Zhu Committed: Wed May 3 11:10:24 2017 -0700 -- .../datasources/InMemoryFileIndex.scala | 13 - .../execution/streaming/FileStreamSink.scala| 20 +++ .../datasources/FileSourceStrategySuite.scala | 2 +- .../sql/streaming/FileStreamSinkSuite.scala | 59 +++- 4 files changed, 90 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6b9e49d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala index 9897ab7..91e3165 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala @@ -27,6 +27,7 @@ import org.apache.hadoop.mapred.{FileInputFormat, JobConf} import org.apache.spark.internal.Logging import org.apache.spark.metrics.source.HiveCatalogMetrics +import org.apache.spark.sql.execution.streaming.FileStreamSink import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.StructType import
spark git commit: [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
Repository: spark Updated Branches: refs/heads/branch-2.2 f0e80aa2d -> 36d807906 [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output ## The Problem Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output: ``` [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) ``` ## What changes were proposed in this pull request? This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out: - (introduced by globbing `basePath/*`) - `basePath/_spark_metadata` - (introduced by globbing `basePath/*/*`) - `basePath/_spark_metadata/0` - `basePath/_spark_metadata/1` - ... ## How was this patch tested? Added unit tests Author: Liwei LinCloses #17346 from lw-lin/filter-metadata. (cherry picked from commit 6b9e49d12fc4c9b29d497122daa4cc9bf4540b16) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36d80790 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36d80790 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36d80790 Branch: refs/heads/branch-2.2 Commit: 36d80790699c529b15e9c1a2cf2f9f636b1f24e6 Parents: f0e80aa Author: Liwei Lin Authored: Wed May 3 11:10:24 2017 -0700 Committer: Shixiong Zhu Committed: Wed May 3 11:10:31 2017 -0700 -- .../datasources/InMemoryFileIndex.scala | 13 - .../execution/streaming/FileStreamSink.scala| 20 +++ .../datasources/FileSourceStrategySuite.scala | 2 +- .../sql/streaming/FileStreamSinkSuite.scala | 59 +++- 4 files changed, 90 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/36d80790/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala index 9897ab7..91e3165 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala @@ -27,6 +27,7 @@ import org.apache.hadoop.mapred.{FileInputFormat, JobConf} import org.apache.spark.internal.Logging import org.apache.spark.metrics.source.HiveCatalogMetrics +import
spark-website git commit: trigger resync
Repository: spark-website Updated Branches: refs/heads/asf-site d4f0c34ac -> 7b32b181f trigger resync Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7b32b181 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7b32b181 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7b32b181 Branch: refs/heads/asf-site Commit: 7b32b181fd554755eab39658c79e56b3ad1b4334 Parents: d4f0c34 Author: Michael ArmbrustAuthored: Wed May 3 10:27:05 2017 -0700 Committer: Michael Armbrust Committed: Wed May 3 10:27:05 2017 -0700 -- news/_posts/2017-05-02-spark-2-1-1-released.md | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/7b32b181/news/_posts/2017-05-02-spark-2-1-1-released.md -- diff --git a/news/_posts/2017-05-02-spark-2-1-1-released.md b/news/_posts/2017-05-02-spark-2-1-1-released.md index fe72279..3dd12f2 100644 --- a/news/_posts/2017-05-02-spark-2-1-1-released.md +++ b/news/_posts/2017-05-02-spark-2-1-1-released.md @@ -12,3 +12,4 @@ meta: _wpas_done_all: '1' --- We are happy to announce the availability of Apache Spark 2.1.1! Visit the release notes to read about the changes, or download the release today. + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] Git Push Summary
Repository: spark-website Updated Branches: refs/heads/spark-2.1.1 [created] d4f0c34ac - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/branch-2.2 b1a732fea -> f0e80aa2d [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold XinCloses #17839 from rxin/SPARK-20576. (cherry picked from commit 527fc5d0c990daaacad4740f62cfe6736609b77b) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0e80aa2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0e80aa2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0e80aa2 Branch: refs/heads/branch-2.2 Commit: f0e80aa2ddee80819ef33ee24eb6a15a73bc02d5 Parents: b1a732f Author: Reynold Xin Authored: Wed May 3 09:22:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 09:22:41 2017 -0700 -- .../sql/catalyst/analysis/ResolveHints.scala | 8 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +- 3 files changed, 40 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala index c4827b8..df688fa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala @@ -86,7 +86,13 @@ object ResolveHints { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case h: Hint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) => -applyBroadcastHint(h.child, h.parameters.toSet) +if (h.parameters.isEmpty) { + // If there is no table alias specified, turn the entire subtree into a BroadcastHint. + BroadcastHint(h.child) +} else { + // Otherwise, find within the subtree query plans that should be broadcasted. + applyBroadcastHint(h.child, h.parameters.toSet) +} } } http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 06dd550..5f602dc 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1074,6 +1074,22 @@ class Dataset[T] private[sql]( def apply(colName: String): Column = col(colName) /** + * Specifies some hint on the current Dataset. As an example, the following code specifies + * that one of the plan can be broadcasted: + * + * {{{ + * df1.join(df2.hint("broadcast")) + * }}} + * + * @group basic + * @since 2.2.0 + */ + @scala.annotation.varargs + def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan { +Hint(name, parameters, logicalPlan) + } + + /** * Selects column based on the column name and return it as a [[Column]]. * * @note The column name can also reference to a nested column like `a.b`. http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala index 541ffb5..4a52af6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala @@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with
spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/master 27f543b15 -> 527fc5d0c [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold XinCloses #17839 from rxin/SPARK-20576. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/527fc5d0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/527fc5d0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/527fc5d0 Branch: refs/heads/master Commit: 527fc5d0c990daaacad4740f62cfe6736609b77b Parents: 27f543b Author: Reynold Xin Authored: Wed May 3 09:22:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 09:22:25 2017 -0700 -- .../sql/catalyst/analysis/ResolveHints.scala | 8 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +- 3 files changed, 40 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala index c4827b8..df688fa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala @@ -86,7 +86,13 @@ object ResolveHints { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case h: Hint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) => -applyBroadcastHint(h.child, h.parameters.toSet) +if (h.parameters.isEmpty) { + // If there is no table alias specified, turn the entire subtree into a BroadcastHint. + BroadcastHint(h.child) +} else { + // Otherwise, find within the subtree query plans that should be broadcasted. + applyBroadcastHint(h.child, h.parameters.toSet) +} } } http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 147e765..620c8bd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1161,6 +1161,22 @@ class Dataset[T] private[sql]( def apply(colName: String): Column = col(colName) /** + * Specifies some hint on the current Dataset. As an example, the following code specifies + * that one of the plan can be broadcasted: + * + * {{{ + * df1.join(df2.hint("broadcast")) + * }}} + * + * @group basic + * @since 2.2.0 + */ + @scala.annotation.varargs + def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan { +Hint(name, parameters, logicalPlan) + } + + /** * Selects column based on the column name and return it as a [[Column]]. * * @note The column name can also reference to a nested column like `a.b`. http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala index 541ffb5..4a52af6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala @@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { Row(1, 1, 1, 1) :: Row(2, 1, 2, 2) :: Nil) } - test("broadcast join hint") { + test("broadcast
spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation
Repository: spark Updated Branches: refs/heads/branch-2.2 b5947f5c3 -> b1a732fea [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation ## What changes were proposed in this pull request? Within the same streaming query, when one `StreamingRelation` is referred multiple times â e.g. `df.union(df)` â we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...). ## How was this patch tested? Added two test cases, each of which would fail without this patch. Author: Liwei LinCloses #17735 from lw-lin/SPARK-20441. (cherry picked from commit 27f543b15f2f493f6f8373e46b4c9564b0a1bf81) Signed-off-by: Burak Yavuz Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1a732fe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1a732fe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1a732fe Branch: refs/heads/branch-2.2 Commit: b1a732fead32a37afcb7cf7a35facc49a449b8e2 Parents: b5947f5 Author: Liwei Lin Authored: Wed May 3 08:55:02 2017 -0700 Committer: Burak Yavuz Committed: Wed May 3 08:55:17 2017 -0700 -- .../execution/streaming/StreamExecution.scala | 20 .../spark/sql/streaming/StreamSuite.scala | 48 2 files changed, 60 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b1a732fe/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala index affc201..b6ddf74 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala @@ -23,6 +23,7 @@ import java.util.concurrent.{CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicReference import java.util.concurrent.locks.ReentrantLock +import scala.collection.mutable.{Map => MutableMap} import scala.collection.mutable.ArrayBuffer import scala.util.control.NonFatal @@ -148,15 +149,18 @@ class StreamExecution( "logicalPlan must be initialized in StreamExecutionThread " + s"but the current thread was ${Thread.currentThread}") var nextSourceId = 0L +val toExecutionRelationMap = MutableMap[StreamingRelation, StreamingExecutionRelation]() val _logicalPlan = analyzedPlan.transform { - case StreamingRelation(dataSource, _, output) => -// Materialize source to avoid creating it in every batch -val metadataPath = s"$checkpointRoot/sources/$nextSourceId" -val source = dataSource.createSource(metadataPath) -nextSourceId += 1 -// We still need to use the previous `output` instead of `source.schema` as attributes in -// "df.logicalPlan" has already used attributes of the previous `output`. -StreamingExecutionRelation(source, output) + case streamingRelation@StreamingRelation(dataSource, _, output) => +toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + // Materialize source to avoid creating it in every batch + val metadataPath = s"$checkpointRoot/sources/$nextSourceId" + val source = dataSource.createSource(metadataPath) + nextSourceId += 1 + // We still need to use the previous `output` instead of `source.schema` as attributes in + // "df.logicalPlan" has already used attributes of the previous `output`. + StreamingExecutionRelation(source, output) +}) } sources = _logicalPlan.collect { case s: StreamingExecutionRelation => s.source } uniqueSources = sources.distinct http://git-wip-us.apache.org/repos/asf/spark/blob/b1a732fe/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala index 01ea62a..1fc0629 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala @@ -71,6 +71,27 @@ class StreamSuite extends StreamTest { CheckAnswer(Row(1, 1, "one"), Row(2, 2, "two"), Row(4, 4,
spark git commit: [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation
Repository: spark Updated Branches: refs/heads/master 7f96f2d7f -> 27f543b15 [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation ## What changes were proposed in this pull request? Within the same streaming query, when one `StreamingRelation` is referred multiple times â e.g. `df.union(df)` â we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...). ## How was this patch tested? Added two test cases, each of which would fail without this patch. Author: Liwei LinCloses #17735 from lw-lin/SPARK-20441. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/27f543b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/27f543b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/27f543b1 Branch: refs/heads/master Commit: 27f543b15f2f493f6f8373e46b4c9564b0a1bf81 Parents: 7f96f2d Author: Liwei Lin Authored: Wed May 3 08:55:02 2017 -0700 Committer: Burak Yavuz Committed: Wed May 3 08:55:02 2017 -0700 -- .../execution/streaming/StreamExecution.scala | 20 .../spark/sql/streaming/StreamSuite.scala | 48 2 files changed, 60 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/27f543b1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala index affc201..b6ddf74 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala @@ -23,6 +23,7 @@ import java.util.concurrent.{CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicReference import java.util.concurrent.locks.ReentrantLock +import scala.collection.mutable.{Map => MutableMap} import scala.collection.mutable.ArrayBuffer import scala.util.control.NonFatal @@ -148,15 +149,18 @@ class StreamExecution( "logicalPlan must be initialized in StreamExecutionThread " + s"but the current thread was ${Thread.currentThread}") var nextSourceId = 0L +val toExecutionRelationMap = MutableMap[StreamingRelation, StreamingExecutionRelation]() val _logicalPlan = analyzedPlan.transform { - case StreamingRelation(dataSource, _, output) => -// Materialize source to avoid creating it in every batch -val metadataPath = s"$checkpointRoot/sources/$nextSourceId" -val source = dataSource.createSource(metadataPath) -nextSourceId += 1 -// We still need to use the previous `output` instead of `source.schema` as attributes in -// "df.logicalPlan" has already used attributes of the previous `output`. -StreamingExecutionRelation(source, output) + case streamingRelation@StreamingRelation(dataSource, _, output) => +toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + // Materialize source to avoid creating it in every batch + val metadataPath = s"$checkpointRoot/sources/$nextSourceId" + val source = dataSource.createSource(metadataPath) + nextSourceId += 1 + // We still need to use the previous `output` instead of `source.schema` as attributes in + // "df.logicalPlan" has already used attributes of the previous `output`. + StreamingExecutionRelation(source, output) +}) } sources = _logicalPlan.collect { case s: StreamingExecutionRelation => s.source } uniqueSources = sources.distinct http://git-wip-us.apache.org/repos/asf/spark/blob/27f543b1/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala index 01ea62a..1fc0629 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala @@ -71,6 +71,27 @@ class StreamSuite extends StreamTest { CheckAnswer(Row(1, 1, "one"), Row(2, 2, "two"), Row(4, 4, "four"))) } + test("SPARK-20432: union one stream with itself") { +val df =
spark git commit: [SPARK-16957][MLLIB] Use midpoints for split values.
Repository: spark Updated Branches: refs/heads/master 16fab6b0e -> 7f96f2d7f [SPARK-16957][MLLIB] Use midpoints for split values. ## What changes were proposed in this pull request? Use midpoints for split values now, and maybe later to make it weighted. ## How was this patch tested? + [x] add unit test. + [x] revise Split's unit test. Author: Yan Facai (é¢åæ)Author: é¢åæï¼Yan Facaiï¼ Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f96f2d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f96f2d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f96f2d7 Branch: refs/heads/master Commit: 7f96f2d7f2d5abf81dd7f8ca27fea35cf798fd65 Parents: 16fab6b Author: Yan Facai (é¢åæ) Authored: Wed May 3 10:54:40 2017 +0100 Committer: Sean Owen Committed: Wed May 3 10:54:40 2017 +0100 -- .../spark/ml/tree/impl/RandomForest.scala | 15 --- .../spark/ml/tree/impl/RandomForestSuite.scala | 41 +--- python/pyspark/mllib/tree.py| 12 +++--- 3 files changed, 51 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7f96f2d7/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala index 008dd19..82e1ed8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala @@ -996,7 +996,7 @@ private[spark] object RandomForest extends Logging { require(metadata.isContinuous(featureIndex), "findSplitsForContinuousFeature can only be used to find splits for a continuous feature.") -val splits = if (featureSamples.isEmpty) { +val splits: Array[Double] = if (featureSamples.isEmpty) { Array.empty[Double] } else { val numSplits = metadata.numSplits(featureIndex) @@ -1009,10 +1009,15 @@ private[spark] object RandomForest extends Logging { // sort distinct values val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray - // if possible splits is not enough or just enough, just return all possible splits val possibleSplits = valueCounts.length - 1 - if (possibleSplits <= numSplits) { -valueCounts.map(_._1).init + if (possibleSplits == 0) { +// constant feature +Array.empty[Double] + } else if (possibleSplits <= numSplits) { +// if possible splits is not enough or just enough, just return all possible splits +(1 to possibleSplits) + .map(index => (valueCounts(index - 1)._1 + valueCounts(index)._1) / 2.0) + .toArray } else { // stride between splits val stride: Double = numSamples.toDouble / (numSplits + 1) @@ -1037,7 +1042,7 @@ private[spark] object RandomForest extends Logging { // makes the gap between currentCount and targetCount smaller, // previous value is a split threshold. if (previousGap < currentGap) { -splitsBuilder += valueCounts(index - 1)._1 +splitsBuilder += (valueCounts(index - 1)._1 + valueCounts(index)._1) / 2.0 targetCount += stride } index += 1 http://git-wip-us.apache.org/repos/asf/spark/blob/7f96f2d7/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala index e1ab7c2..df155b4 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala @@ -104,6 +104,31 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { assert(splits.distinct.length === splits.length) } +// SPARK-16957: Use midpoints for split values. +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + + // possibleSplits <= numSplits + { +val featureSamples = Array(0, 1, 0, 0, 1, 0, 1, 1).map(_.toDouble) +val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) +val expectedSplits = Array((0.0
spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release
Repository: spark Updated Branches: refs/heads/branch-2.2 4f647ab66 -> b5947f5c3 [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release ## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean OwenCloses #17803 from srowen/SPARK-20523. (cherry picked from commit 16fab6b0ef3dcb33f92df30e17680922ad5fb672) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5947f5c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5947f5c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5947f5c Branch: refs/heads/branch-2.2 Commit: b5947f5c33eb403d65b1c316d1781c3d7cebf01b Parents: 4f647ab Author: Sean Owen Authored: Wed May 3 10:18:35 2017 +0100 Committer: Sean Owen Committed: Wed May 3 10:18:48 2017 +0100 -- .../apache/spark/network/yarn/YarnShuffleService.java | 4 ++-- .../main/java/org/apache/spark/unsafe/Platform.java | 3 ++- .../org/apache/spark/memory/TaskMemoryManager.java| 3 ++- .../apache/spark/scheduler/TaskSetManagerSuite.scala | 11 ++- .../spark/storage/BlockReplicationPolicySuite.scala | 1 + dev/checkstyle-suppressions.xml | 4 .../sql/streaming/JavaStructuredSessionization.java | 2 -- .../scala/org/apache/spark/graphx/lib/PageRank.scala | 14 +++--- .../scala/org/apache/spark/ml/ann/LossFunction.scala | 4 ++-- .../apache/spark/ml/clustering/GaussianMixture.scala | 2 +- .../spark/mllib/clustering/GaussianMixture.scala | 2 +- .../org/apache/spark/mllib/clustering/LDAModel.scala | 8 .../apache/spark/mllib/clustering/LDAOptimizer.scala | 12 ++-- .../org/apache/spark/mllib/clustering/LDAUtils.scala | 2 +- .../spark/ml/classification/NaiveBayesSuite.scala | 2 +- pom.xml | 4 .../scheduler/cluster/YarnSchedulerBackendSuite.scala | 2 ++ .../apache/spark/sql/streaming/GroupStateTimeout.java | 5 - .../catalyst/expressions/JsonExpressionsSuite.scala | 2 +- .../parquet/SpecificParquetRecordReaderBase.java | 5 +++-- .../spark/sql/execution/QueryExecutionSuite.scala | 2 ++ .../sql/streaming/StreamingQueryListenerSuite.scala | 1 + .../spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 23 files changed, 54 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b5947f5c/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java -- diff --git a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java index 4acc203..fd50e3a 100644 --- a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java +++ b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java @@ -363,9 +363,9 @@ public class YarnShuffleService extends AuxiliaryService { // If another DB was initialized first just make sure all the DBs are in the same // location. Path newLoc = new Path(_recoveryPath, dbName); - Path copyFrom = new Path(f.toURI()); + Path copyFrom = new Path(f.toURI()); if (!newLoc.equals(copyFrom)) { -logger.info("Moving " + copyFrom + " to: " + newLoc); +logger.info("Moving " + copyFrom + " to: " + newLoc); try { // The move here needs to handle moving non-empty directories across NFS mounts FileSystem fs = FileSystem.getLocal(_conf); http://git-wip-us.apache.org/repos/asf/spark/blob/b5947f5c/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java -- diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java index 1321b83..4ab5b68 100644 --- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java +++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java @@ -48,7 +48,8 @@ public final class Platform { boolean _unaligned; String arch = System.getProperty("os.arch", ""); if (arch.equals("ppc64le") || arch.equals("ppc64")) { - // Since java.nio.Bits.unaligned() doesn't return true on ppc (See JDK-8165231), but ppc64 and ppc64le support it + // Since java.nio.Bits.unaligned() doesn't return true on ppc (See
spark git commit: [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release
Repository: spark Updated Branches: refs/heads/master db2fb84b4 -> 16fab6b0e [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release ## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean OwenCloses #17803 from srowen/SPARK-20523. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16fab6b0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16fab6b0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16fab6b0 Branch: refs/heads/master Commit: 16fab6b0ef3dcb33f92df30e17680922ad5fb672 Parents: db2fb84 Author: Sean Owen Authored: Wed May 3 10:18:35 2017 +0100 Committer: Sean Owen Committed: Wed May 3 10:18:35 2017 +0100 -- .../apache/spark/network/yarn/YarnShuffleService.java | 4 ++-- .../main/java/org/apache/spark/unsafe/Platform.java | 3 ++- .../org/apache/spark/memory/TaskMemoryManager.java| 3 ++- .../apache/spark/scheduler/TaskSetManagerSuite.scala | 11 ++- .../spark/storage/BlockReplicationPolicySuite.scala | 1 + dev/checkstyle-suppressions.xml | 4 .../sql/streaming/JavaStructuredSessionization.java | 2 -- .../scala/org/apache/spark/graphx/lib/PageRank.scala | 14 +++--- .../scala/org/apache/spark/ml/ann/LossFunction.scala | 4 ++-- .../apache/spark/ml/clustering/GaussianMixture.scala | 2 +- .../spark/mllib/clustering/GaussianMixture.scala | 2 +- .../org/apache/spark/mllib/clustering/LDAModel.scala | 8 .../apache/spark/mllib/clustering/LDAOptimizer.scala | 12 ++-- .../org/apache/spark/mllib/clustering/LDAUtils.scala | 2 +- .../spark/ml/classification/NaiveBayesSuite.scala | 2 +- pom.xml | 4 .../scheduler/cluster/YarnSchedulerBackendSuite.scala | 2 ++ .../apache/spark/sql/streaming/GroupStateTimeout.java | 5 - .../catalyst/expressions/JsonExpressionsSuite.scala | 2 +- .../parquet/SpecificParquetRecordReaderBase.java | 5 +++-- .../spark/sql/execution/QueryExecutionSuite.scala | 2 ++ .../sql/streaming/StreamingQueryListenerSuite.scala | 1 + .../spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 23 files changed, 54 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16fab6b0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java -- diff --git a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java index 4acc203..fd50e3a 100644 --- a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java +++ b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java @@ -363,9 +363,9 @@ public class YarnShuffleService extends AuxiliaryService { // If another DB was initialized first just make sure all the DBs are in the same // location. Path newLoc = new Path(_recoveryPath, dbName); - Path copyFrom = new Path(f.toURI()); + Path copyFrom = new Path(f.toURI()); if (!newLoc.equals(copyFrom)) { -logger.info("Moving " + copyFrom + " to: " + newLoc); +logger.info("Moving " + copyFrom + " to: " + newLoc); try { // The move here needs to handle moving non-empty directories across NFS mounts FileSystem fs = FileSystem.getLocal(_conf); http://git-wip-us.apache.org/repos/asf/spark/blob/16fab6b0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java -- diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java index 1321b83..4ab5b68 100644 --- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java +++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java @@ -48,7 +48,8 @@ public final class Platform { boolean _unaligned; String arch = System.getProperty("os.arch", ""); if (arch.equals("ppc64le") || arch.equals("ppc64")) { - // Since java.nio.Bits.unaligned() doesn't return true on ppc (See JDK-8165231), but ppc64 and ppc64le support it + // Since java.nio.Bits.unaligned() doesn't return true on ppc (See JDK-8165231), but + // ppc64 and ppc64le support it _unaligned = true; } else { try {
spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)
Repository: spark Updated Branches: refs/heads/master 6235132a8 -> db2fb84b4 [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoderAuthor: Nick Pentreath Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db2fb84b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db2fb84b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db2fb84b Branch: refs/heads/master Commit: db2fb84b4a3c45daa449cc9232340193ce8eb37d Parents: 6235132 Author: MechCoder Authored: Wed May 3 10:58:05 2017 +0200 Committer: Nick Pentreath Committed: Wed May 3 10:58:05 2017 +0200 -- docs/mllib-dimensionality-reduction.md | 29 +-- .../spark/examples/mllib/JavaPCAExample.java| 27 ++- .../spark/examples/mllib/JavaSVDExample.java| 27 +-- .../main/python/mllib/pca_rowmatrix_example.py | 46 + examples/src/main/python/mllib/svd_example.py | 48 + .../examples/mllib/PCAOnRowMatrixExample.scala | 4 +- .../spark/examples/mllib/SVDExample.scala | 11 +- python/pyspark/mllib/linalg/distributed.py | 199 ++- python/pyspark/mllib/tests.py | 63 ++ 9 files changed, 408 insertions(+), 46 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db2fb84b/docs/mllib-dimensionality-reduction.md -- diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md index 539cbc1..a72680d 100644 --- a/docs/mllib-dimensionality-reduction.md +++ b/docs/mllib-dimensionality-reduction.md @@ -76,13 +76,14 @@ Refer to the [`SingularValueDecomposition` Java docs](api/java/org/apache/spark/ The same code applies to `IndexedRowMatrix` if `U` is defined as an `IndexedRowMatrix`. + + +Refer to the [`SingularValueDecomposition` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition) for details on the API. -In order to run the above application, follow the instructions -provided in the [Self-Contained -Applications](quick-start.html#self-contained-applications) section of the Spark -quick-start guide. Be sure to also include *spark-mllib* to your build file as -a dependency. +{% include_example python/mllib/svd_example.py %} +The same code applies to `IndexedRowMatrix` if `U` is defined as an +`IndexedRowMatrix`. @@ -118,17 +119,21 @@ Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feat The following code demonstrates how to compute principal components on a `RowMatrix` and use them to project the vectors into a low-dimensional space. -The number of columns should be small, e.g, less than 1000. Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API. {% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %} - -In order to run the above application, follow the instructions -provided in the [Self-Contained Applications](quick-start.html#self-contained-applications) -section of the Spark -quick-start guide. Be sure to also include *spark-mllib* to your build file as -a dependency. + + +The following code demonstrates how to compute principal components on a `RowMatrix` +and use them to project the vectors into a low-dimensional space. + +Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for details on the API. + +{% include_example python/mllib/pca_rowmatrix_example.py %} + + + http://git-wip-us.apache.org/repos/asf/spark/blob/db2fb84b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java index 3077f55..0a7dc62 100644 --- a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java +++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java @@ -18,7 +18,8 @@ package org.apache.spark.examples.mllib; // $example on$ -import java.util.LinkedList; +import java.util.Arrays; +import java.util.List; // $example off$ import org.apache.spark.SparkConf; @@ -39,21 +40,25 @@ public class JavaPCAExample {
spark git commit: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)
Repository: spark Updated Branches: refs/heads/branch-2.2 c80242ab9 -> 4f647ab66 [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoderAuthor: Nick Pentreath Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca. (cherry picked from commit db2fb84b4a3c45daa449cc9232340193ce8eb37d) Signed-off-by: Nick Pentreath Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f647ab6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f647ab6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f647ab6 Branch: refs/heads/branch-2.2 Commit: 4f647ab66353b136e4fdf02587ebbd88ce5c5b5f Parents: c80242a Author: MechCoder Authored: Wed May 3 10:58:05 2017 +0200 Committer: Nick Pentreath Committed: Wed May 3 10:58:20 2017 +0200 -- docs/mllib-dimensionality-reduction.md | 29 +-- .../spark/examples/mllib/JavaPCAExample.java| 27 ++- .../spark/examples/mllib/JavaSVDExample.java| 27 +-- .../main/python/mllib/pca_rowmatrix_example.py | 46 + examples/src/main/python/mllib/svd_example.py | 48 + .../examples/mllib/PCAOnRowMatrixExample.scala | 4 +- .../spark/examples/mllib/SVDExample.scala | 11 +- python/pyspark/mllib/linalg/distributed.py | 199 ++- python/pyspark/mllib/tests.py | 63 ++ 9 files changed, 408 insertions(+), 46 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4f647ab6/docs/mllib-dimensionality-reduction.md -- diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md index 539cbc1..a72680d 100644 --- a/docs/mllib-dimensionality-reduction.md +++ b/docs/mllib-dimensionality-reduction.md @@ -76,13 +76,14 @@ Refer to the [`SingularValueDecomposition` Java docs](api/java/org/apache/spark/ The same code applies to `IndexedRowMatrix` if `U` is defined as an `IndexedRowMatrix`. + + +Refer to the [`SingularValueDecomposition` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition) for details on the API. -In order to run the above application, follow the instructions -provided in the [Self-Contained -Applications](quick-start.html#self-contained-applications) section of the Spark -quick-start guide. Be sure to also include *spark-mllib* to your build file as -a dependency. +{% include_example python/mllib/svd_example.py %} +The same code applies to `IndexedRowMatrix` if `U` is defined as an +`IndexedRowMatrix`. @@ -118,17 +119,21 @@ Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feat The following code demonstrates how to compute principal components on a `RowMatrix` and use them to project the vectors into a low-dimensional space. -The number of columns should be small, e.g, less than 1000. Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API. {% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %} - -In order to run the above application, follow the instructions -provided in the [Self-Contained Applications](quick-start.html#self-contained-applications) -section of the Spark -quick-start guide. Be sure to also include *spark-mllib* to your build file as -a dependency. + + +The following code demonstrates how to compute principal components on a `RowMatrix` +and use them to project the vectors into a low-dimensional space. + +Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for details on the API. + +{% include_example python/mllib/pca_rowmatrix_example.py %} + + + http://git-wip-us.apache.org/repos/asf/spark/blob/4f647ab6/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java index 3077f55..0a7dc62 100644 --- a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java +++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPCAExample.java @@ -18,7 +18,8 @@ package org.apache.spark.examples.mllib; // $example on$ -import java.util.LinkedList; +import java.util.Arrays; +import