spark git commit: [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest
Repository: spark Updated Branches: refs/heads/master 7d432af8f -> a5c87707e [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest ## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago AmirbekianCloses #17421 from MrBago/chiSquareTestWrapper. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5c87707 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5c87707 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5c87707 Branch: refs/heads/master Commit: a5c87707eaec5cacdfb703eb396dfc264bc54cda Parents: 7d432af Author: Bago Amirbekian Authored: Tue Mar 28 19:19:16 2017 -0700 Committer: Joseph K. Bradley Committed: Tue Mar 28 19:19:16 2017 -0700 -- dev/sparktestsupport/modules.py | 1 + .../apache/spark/ml/stat/ChiSquareTest.scala| 6 +- python/docs/pyspark.ml.rst | 8 ++ python/pyspark/ml/stat.py | 93 python/pyspark/ml/tests.py | 31 +-- 5 files changed, 127 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a5c87707/dev/sparktestsupport/modules.py -- diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index eaf1f3a..246f518 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -431,6 +431,7 @@ pyspark_ml = Module( "pyspark.ml.linalg.__init__", "pyspark.ml.recommendation", "pyspark.ml.regression", +"pyspark.ml.stat", "pyspark.ml.tuning", "pyspark.ml.tests", ], http://git-wip-us.apache.org/repos/asf/spark/blob/a5c87707/mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala index 21eba9a..5b38ca7 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala @@ -46,9 +46,9 @@ object ChiSquareTest { statistics: Vector) /** - * Conduct Pearson's independence test for every feature against the label across the input RDD. - * For each feature, the (feature, label) pairs are converted into a contingency matrix for which - * the Chi-squared statistic is computed. All label and feature values must be categorical. + * Conduct Pearson's independence test for every feature against the label. For each feature, the + * (feature, label) pairs are converted into a contingency matrix for which the Chi-squared + * statistic is computed. All label and feature values must be categorical. * * The null hypothesis is that the occurrence of the outcomes is statistically independent. * http://git-wip-us.apache.org/repos/asf/spark/blob/a5c87707/python/docs/pyspark.ml.rst -- diff --git a/python/docs/pyspark.ml.rst b/python/docs/pyspark.ml.rst index a681834..930646d 100644 --- a/python/docs/pyspark.ml.rst +++ b/python/docs/pyspark.ml.rst @@ -65,6 +65,14 @@ pyspark.ml.regression module :undoc-members: :inherited-members: +pyspark.ml.stat module +-- + +.. automodule:: pyspark.ml.stat +:members: +:undoc-members: +:inherited-members: + pyspark.ml.tuning module http://git-wip-us.apache.org/repos/asf/spark/blob/a5c87707/python/pyspark/ml/stat.py -- diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py new file mode 100644 index 000..db043ff --- /dev/null +++ b/python/pyspark/ml/stat.py @@ -0,0 +1,93 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language
spark git commit: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini
Repository: spark Updated Branches: refs/heads/branch-2.1 4964dbedb -> 30954806f [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [x] add unit test + [x] fix the bug Author: é¢åæï¼Yan Facaiï¼Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity. (cherry picked from commit 7d432af8f3c47973550ea253dae0c23cd2961bde) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30954806 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30954806 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30954806 Branch: refs/heads/branch-2.1 Commit: 30954806f1be0dba63f0a608d824d7d811485801 Parents: 4964dbe Author: é¢åæï¼Yan Facaiï¼ Authored: Tue Mar 28 16:14:01 2017 -0700 Committer: Joseph K. Bradley Committed: Tue Mar 28 16:14:11 2017 -0700 -- .../apache/spark/mllib/tree/impurity/Impurity.scala | 2 +- .../classification/DecisionTreeClassifierSuite.scala | 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/30954806/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala index a5bdc2c..98a3021 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala @@ -184,7 +184,7 @@ private[spark] object ImpurityCalculator { * the given stats. */ def getCalculator(impurity: String, stats: Array[Double]): ImpurityCalculator = { -impurity match { +impurity.toLowerCase match { case "gini" => new GiniCalculator(stats) case "entropy" => new EntropyCalculator(stats) case "variance" => new VarianceCalculator(stats) http://git-wip-us.apache.org/repos/asf/spark/blob/30954806/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala index c711e7f..692a172 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala @@ -383,6 +383,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SPARK-20043: " + + "ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write") { +val rdd = TreeTests.getTreeReadWriteData(sc) +val data: DataFrame = + TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2) + +val dt = new DecisionTreeClassifier() + .setImpurity("Gini") + .setMaxDepth(2) +val model = dt.fit(data) + +testDefaultReadWrite(model) + } } private[ml] object DecisionTreeClassifierSuite extends SparkFunSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini
Repository: spark Updated Branches: refs/heads/master 92e385e0b -> 7d432af8f [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [x] add unit test + [x] fix the bug Author: é¢åæï¼Yan Facaiï¼Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d432af8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d432af8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d432af8 Branch: refs/heads/master Commit: 7d432af8f3c47973550ea253dae0c23cd2961bde Parents: 92e385e Author: é¢åæï¼Yan Facaiï¼ Authored: Tue Mar 28 16:14:01 2017 -0700 Committer: Joseph K. Bradley Committed: Tue Mar 28 16:14:01 2017 -0700 -- .../apache/spark/mllib/tree/impurity/Impurity.scala | 2 +- .../classification/DecisionTreeClassifierSuite.scala | 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7d432af8/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala index a5bdc2c..98a3021 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala @@ -184,7 +184,7 @@ private[spark] object ImpurityCalculator { * the given stats. */ def getCalculator(impurity: String, stats: Array[Double]): ImpurityCalculator = { -impurity match { +impurity.toLowerCase match { case "gini" => new GiniCalculator(stats) case "entropy" => new EntropyCalculator(stats) case "variance" => new VarianceCalculator(stats) http://git-wip-us.apache.org/repos/asf/spark/blob/7d432af8/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala index 10de503..964fcfb 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala @@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SPARK-20043: " + + "ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write") { +val rdd = TreeTests.getTreeReadWriteData(sc) +val data: DataFrame = + TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2) + +val dt = new DecisionTreeClassifier() + .setImpurity("Gini") + .setMaxDepth(2) +val model = dt.fit(data) + +testDefaultReadWrite(model) + } } private[ml] object DecisionTreeClassifierSuite extends SparkFunSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Preparing Spark release v2.1.1-rc2
Repository: spark Updated Branches: refs/heads/branch-2.1 e669dd7ea -> 4964dbedb Preparing Spark release v2.1.1-rc2 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02b165dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02b165dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02b165dc Branch: refs/heads/branch-2.1 Commit: 02b165dcc2ee5245d1293a375a31660c9d4e1fa6 Parents: e669dd7 Author: Patrick WendellAuthored: Tue Mar 28 14:29:03 2017 -0700 Committer: Patrick Wendell Committed: Tue Mar 28 14:29:03 2017 -0700 -- R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mesos/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 39 files changed, 40 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/02b165dc/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 2d461ca..1ceda7b 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,6 +1,6 @@ Package: SparkR Type: Package -Version: 2.1.2 +Version: 2.1.1 Title: R Frontend for Apache Spark Description: The SparkR package provides an R Frontend for Apache Spark. Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), http://git-wip-us.apache.org/repos/asf/spark/blob/02b165dc/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 6e092ef..cc290c0 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.1.2-SNAPSHOT +2.1.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/02b165dc/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 77a4b64..ccf4b27 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.1.2-SNAPSHOT +2.1.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/02b165dc/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 1a2d85a..98a2324 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.1.2-SNAPSHOT +2.1.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/02b165dc/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 7a57e89..dc1ad14 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@
[2/2] spark git commit: Preparing development version 2.1.2-SNAPSHOT
Preparing development version 2.1.2-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4964dbed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4964dbed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4964dbed Branch: refs/heads/branch-2.1 Commit: 4964dbedbdc2a127b2d5afb6ec11043a62a6e6f6 Parents: 02b165d Author: Patrick WendellAuthored: Tue Mar 28 14:29:08 2017 -0700 Committer: Patrick Wendell Committed: Tue Mar 28 14:29:08 2017 -0700 -- R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mesos/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 39 files changed, 40 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4964dbed/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 1ceda7b..2d461ca 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,6 +1,6 @@ Package: SparkR Type: Package -Version: 2.1.1 +Version: 2.1.2 Title: R Frontend for Apache Spark Description: The SparkR package provides an R Frontend for Apache Spark. Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), http://git-wip-us.apache.org/repos/asf/spark/blob/4964dbed/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index cc290c0..6e092ef 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.1.1 +2.1.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/4964dbed/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index ccf4b27..77a4b64 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.1.1 +2.1.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/4964dbed/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 98a2324..1a2d85a 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.1.1 +2.1.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/4964dbed/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index dc1ad14..7a57e89 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.1.1 +
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.1.1-rc2 [created] 02b165dcc - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array type column for postgres.
Repository: spark Updated Branches: refs/heads/branch-2.1 fd2e40614 -> e669dd7ea [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array type column for postgres. ## What changes were proposed in this pull request? JDBC read is failing with NPE due to missing null value check for array data type if the source table has null values in the array type column. For null values Resultset.getArray() returns null. This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object ## How was this patch tested? Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop. Author: sureshthalamatiCloses #17460 from sureshthalamati/jdbc_array_null_fix_spark_2.1-SPARK-14536. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e669dd7e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e669dd7e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e669dd7e Branch: refs/heads/branch-2.1 Commit: e669dd7ea474f65fea0d5df011a333bda9de91b4 Parents: fd2e406 Author: sureshthalamati Authored: Tue Mar 28 14:02:01 2017 -0700 Committer: Xiao Li Committed: Tue Mar 28 14:02:01 2017 -0700 -- .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 12 ++-- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 6 +++--- 2 files changed, 13 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e669dd7e/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala -- diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala index c9325de..a1a065a 100644 --- a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala +++ b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala @@ -51,12 +51,17 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { + "B'1000100101', E'xDEADBEEF', true, '172.16.0.42', '192.168.0.0/16', " + """'{1, 2}', '{"a", null, "b"}', '{0.11, 0.22}', '{0.11, 0.22}', 'd1', 1.01, 1)""" ).executeUpdate() +conn.prepareStatement("INSERT INTO bar VALUES (null, null, null, null, null, " + + "null, null, null, null, null, " + + "null, null, null, null, null, null, null)" +).executeUpdate() } test("Type mapping for various types") { val df = sqlContext.read.jdbc(jdbcUrl, "bar", new Properties) -val rows = df.collect() -assert(rows.length == 1) +val rows = df.collect().sortBy(_.toString()) +assert(rows.length == 2) +// Test the types, and values using the first row. val types = rows(0).toSeq.map(x => x.getClass) assert(types.length == 17) assert(classOf[String].isAssignableFrom(types(0))) @@ -96,6 +101,9 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { assert(rows(0).getString(14) == "d1") assert(rows(0).getFloat(15) == 1.01f) assert(rows(0).getShort(16) == 1) + +// Test reading null values using the second row. +assert(0.until(16).forall(rows(1).isNullAt(_))) } test("Basic write test") { http://git-wip-us.apache.org/repos/asf/spark/blob/e669dd7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala index 41edb65..81fdf69 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala @@ -423,9 +423,9 @@ object JdbcUtils extends Logging { } (rs: ResultSet, row: InternalRow, pos: Int) => -val array = nullSafeConvert[Object]( - rs.getArray(pos + 1).getArray, - array => new GenericArrayData(elementConversion.apply(array))) +val array = nullSafeConvert[java.sql.Array]( + input = rs.getArray(pos + 1), + array => new GenericArrayData(elementConversion.apply(array.getArray))) row.update(pos, array) case _ => throw new IllegalArgumentException(s"Unsupported type ${dt.simpleString}")
spark git commit: [SPARK-20125][SQL] Dataset of type option of map does not work
Repository: spark Updated Branches: refs/heads/branch-2.1 4bcb7d676 -> fd2e40614 [SPARK-20125][SQL] Dataset of type option of map does not work When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it. new regression test Author: Wenchen FanCloses #17454 from cloud-fan/map. (cherry picked from commit d4fac410e0554b7ccd44be44b7ce2fe07ed7f206) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd2e4061 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd2e4061 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd2e4061 Branch: refs/heads/branch-2.1 Commit: fd2e40614b511fb9ef3e52cc1351659fdbfd612a Parents: 4bcb7d6 Author: Wenchen Fan Authored: Tue Mar 28 11:47:43 2017 -0700 Committer: Cheng Lian Committed: Tue Mar 28 12:36:27 2017 -0700 -- .../src/main/scala/org/apache/spark/sql/types/ObjectType.scala | 5 + .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ 2 files changed, 11 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fd2e4061/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala index b18fba2..2d49fe0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala @@ -44,4 +44,9 @@ case class ObjectType(cls: Class[_]) extends DataType { def asNullable: DataType = this override def simpleString: String = cls.getName + + override def acceptsType(other: DataType): Boolean = other match { +case ObjectType(otherCls) => cls.isAssignableFrom(otherCls) +case _ => false + } } http://git-wip-us.apache.org/repos/asf/spark/blob/fd2e4061/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 381652d..9cc49b6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -1072,10 +1072,16 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val ds2 = Seq(WithMap("hi", Map(42L -> "foo"))).toDS checkDataset(ds2.map(t => t), WithMap("hi", Map(42L -> "foo"))) } + + test("SPARK-20125: option of map") { +val ds = Seq(WithMapInOption(Some(Map(1 -> 1.toDS() +checkDataset(ds, WithMapInOption(Some(Map(1 -> 1 + } } case class WithImmutableMap(id: String, map_test: scala.collection.immutable.Map[Long, String]) case class WithMap(id: String, map_test: scala.collection.Map[Long, String]) +case class WithMapInOption(m: Option[scala.collection.Map[Int, Int]]) case class Generic[T](id: T, value: Double) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19868] conflict TasksetManager lead to spark stopped
Repository: spark Updated Branches: refs/heads/master d4fac410e -> 92e385e0b [SPARK-19868] conflict TasksetManager lead to spark stopped ## What changes were proposed in this pull request? We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets. Author: liujianhuiCloses #17208 from liujianhuiouc/spark-19868. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92e385e0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92e385e0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92e385e0 Branch: refs/heads/master Commit: 92e385e0b55d70a48411e90aa0f2ed141c4d07c8 Parents: d4fac41 Author: liujianhui Authored: Tue Mar 28 12:13:45 2017 -0700 Committer: Kay Ousterhout Committed: Tue Mar 28 12:13:45 2017 -0700 -- .../apache/spark/scheduler/TaskSetManager.scala | 15 ++- .../spark/scheduler/TaskSetManagerSuite.scala | 27 +++- 2 files changed, 34 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/92e385e0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala index a177aab..a41b059 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala @@ -713,13 +713,7 @@ private[spark] class TaskSetManager( successfulTaskDurations.insert(info.duration) } removeRunningTask(tid) -// This method is called by "TaskSchedulerImpl.handleSuccessfulTask" which holds the -// "TaskSchedulerImpl" lock until exiting. To avoid the SPARK-7655 issue, we should not -// "deserialize" the value when holding a lock to avoid blocking other threads. So we call -// "result.value()" in "TaskResultGetter.enqueueSuccessfulTask" before reaching here. -// Note: "result.value()" only deserializes the value when it's called at the first time, so -// here "result.value()" just returns the value and won't block other threads. -sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates, info) + // Kill any other attempts for the same task (since those are unnecessary now that one // attempt completed successfully). for (attemptInfo <- taskAttempts(index) if attemptInfo.running) { @@ -746,6 +740,13 @@ private[spark] class TaskSetManager( logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id + " because task " + index + " has already completed successfully") } +// This method is called by "TaskSchedulerImpl.handleSuccessfulTask" which holds the +// "TaskSchedulerImpl" lock until exiting. To avoid the SPARK-7655 issue, we should not +// "deserialize" the value when holding a lock to avoid blocking other threads. So we call +// "result.value()" in "TaskResultGetter.enqueueSuccessfulTask" before reaching here. +// Note: "result.value()" only deserializes the value when it's called at the first time, so +// here "result.value()" just returns the value and won't block other threads. +sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates, info) maybeFinishTaskSet() } http://git-wip-us.apache.org/repos/asf/spark/blob/92e385e0/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index 132caef..9ca6b8b 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -22,8 +22,10 @@ import java.util.{Properties, Random} import scala.collection.mutable import scala.collection.mutable.ArrayBuffer -import org.mockito.Matchers.{anyInt, anyString} +import org.mockito.Matchers.{any, anyInt, anyString} import org.mockito.Mockito.{mock, never, spy, verify, when} +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer import org.apache.spark._ import org.apache.spark.internal.config @@ -1056,6 +1058,29 @@
spark git commit: [SPARK-20125][SQL] Dataset of type option of map does not work
Repository: spark Updated Branches: refs/heads/master 17eddb35a -> d4fac410e [SPARK-20125][SQL] Dataset of type option of map does not work ## What changes were proposed in this pull request? When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it. ## How was this patch tested? new regression test Author: Wenchen FanCloses #17454 from cloud-fan/map. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4fac410 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4fac410 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4fac410 Branch: refs/heads/master Commit: d4fac410e0554b7ccd44be44b7ce2fe07ed7f206 Parents: 17eddb3 Author: Wenchen Fan Authored: Tue Mar 28 11:47:43 2017 -0700 Committer: Cheng Lian Committed: Tue Mar 28 11:47:43 2017 -0700 -- .../src/main/scala/org/apache/spark/sql/types/ObjectType.scala | 5 + .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ 2 files changed, 11 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d4fac410/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala index b18fba2..2d49fe0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala @@ -44,4 +44,9 @@ case class ObjectType(cls: Class[_]) extends DataType { def asNullable: DataType = this override def simpleString: String = cls.getName + + override def acceptsType(other: DataType): Boolean = other match { +case ObjectType(otherCls) => cls.isAssignableFrom(otherCls) +case _ => false + } } http://git-wip-us.apache.org/repos/asf/spark/blob/d4fac410/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 6417e7a..68e071a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -1154,10 +1154,16 @@ class DatasetSuite extends QueryTest with SharedSQLContext { assert(errMsg3.getMessage.startsWith("cannot have circular references in class, but got the " + "circular reference of class")) } + + test("SPARK-20125: option of map") { +val ds = Seq(WithMapInOption(Some(Map(1 -> 1.toDS() +checkDataset(ds, WithMapInOption(Some(Map(1 -> 1 + } } case class WithImmutableMap(id: String, map_test: scala.collection.immutable.Map[Long, String]) case class WithMap(id: String, map_test: scala.collection.Map[Long, String]) +case class WithMapInOption(m: Option[scala.collection.Map[Int, Int]]) case class Generic[T](id: T, value: Double) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
Repository: spark Updated Branches: refs/heads/branch-2.1 4056191d3 -> 4bcb7d676 [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode ## What changes were proposed in this pull request? In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user. So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above. ## How was this patch tested? Local verified in secure cluster. vanzin tgravescs mridulm dongjoon-hyun please help to review, thanks a lot. Author: jerryshaoCloses #17335 from jerryshao/SPARK-19995. (cherry picked from commit 17eddb35a280e77da7520343e0bf2a86b329ed62) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4bcb7d67 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4bcb7d67 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4bcb7d67 Branch: refs/heads/branch-2.1 Commit: 4bcb7d676440dedff737a10c98c308d8f2ed1c96 Parents: 4056191 Author: jerryshao Authored: Tue Mar 28 10:41:11 2017 -0700 Committer: Marcelo Vanzin Committed: Tue Mar 28 10:41:28 2017 -0700 -- yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4bcb7d67/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 5280c42..1ba736b 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -403,6 +403,9 @@ private[spark] class Client( val nearestTimeOfNextRenewal = credentialManager.obtainCredentials(hadoopConf, credentials) if (credentials != null) { + // Add credentials to current user's UGI, so that following operations don't need to use the + // Kerberos tgt to get delegations again in the client side. + UserGroupInformation.getCurrentUser.addCredentials(credentials) logDebug(YarnSparkHadoopUtil.get.dumpTokens(credentials).mkString("\n")) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
Repository: spark Updated Branches: refs/heads/master f82461fc1 -> 17eddb35a [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode ## What changes were proposed in this pull request? In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user. So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above. ## How was this patch tested? Local verified in secure cluster. vanzin tgravescs mridulm dongjoon-hyun please help to review, thanks a lot. Author: jerryshaoCloses #17335 from jerryshao/SPARK-19995. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17eddb35 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17eddb35 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17eddb35 Branch: refs/heads/master Commit: 17eddb35a280e77da7520343e0bf2a86b329ed62 Parents: f82461f Author: jerryshao Authored: Tue Mar 28 10:41:11 2017 -0700 Committer: Marcelo Vanzin Committed: Tue Mar 28 10:41:11 2017 -0700 -- .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/17eddb35/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala -- diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index ccb0f8f..3218d22 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -371,6 +371,9 @@ private[spark] class Client( val nearestTimeOfNextRenewal = credentialManager.obtainCredentials(hadoopConf, credentials) if (credentials != null) { + // Add credentials to current user's UGI, so that following operations don't need to use the + // Kerberos tgt to get delegations again in the client side. + UserGroupInformation.getCurrentUser.addCredentials(credentials) logDebug(YarnSparkHadoopUtil.get.dumpTokens(credentials).mkString("\n")) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20126][SQL] Remove HiveSessionState
Repository: spark Updated Branches: refs/heads/master 4fcc214d9 -> f82461fc1 [SPARK-20126][SQL] Remove HiveSessionState ## What changes were proposed in this pull request? Commit https://github.com/apache/spark/commit/ea361165e1ddce4d8aa0242ae3e878d7b39f1de2 moved most of the logic from the SessionState classes into an accompanying builder. This makes the existence of the `HiveSessionState` redundant. This PR removes the `HiveSessionState`. ## How was this patch tested? Existing tests. Author: Herman van HovellCloses #17457 from hvanhovell/SPARK-20126. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f82461fc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f82461fc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f82461fc Branch: refs/heads/master Commit: f82461fc1197f6055d9cf972d82260b178e10a7c Parents: 4fcc214 Author: Herman van Hovell Authored: Tue Mar 28 23:14:31 2017 +0800 Committer: Wenchen Fan Committed: Tue Mar 28 23:14:31 2017 +0800 -- .../spark/sql/execution/command/resources.scala | 2 +- .../spark/sql/internal/SessionState.scala | 47 +++--- .../sql/internal/sessionStateBuilders.scala | 8 +- .../sql/hive/thriftserver/SparkSQLEnv.scala | 12 +- .../server/SparkSQLOperationManager.scala | 6 +- .../hive/execution/HiveCompatibilitySuite.scala | 2 +- .../org/apache/spark/sql/hive/HiveContext.scala | 4 - .../spark/sql/hive/HiveMetastoreCatalog.scala | 9 +- .../spark/sql/hive/HiveSessionState.scala | 144 +++ .../apache/spark/sql/hive/test/TestHive.scala | 23 ++- .../sql/hive/HiveMetastoreCatalogSuite.scala| 6 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 7 +- .../spark/sql/hive/execution/HiveDDLSuite.scala | 6 +- .../apache/spark/sql/hive/parquetSuites.scala | 21 +-- 14 files changed, 104 insertions(+), 193 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f82461fc/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala index 20b0894..2e859cf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala @@ -37,7 +37,7 @@ case class AddJarCommand(path: String) extends RunnableCommand { } override def run(sparkSession: SparkSession): Seq[Row] = { -sparkSession.sessionState.addJar(path) +sparkSession.sessionState.resourceLoader.addJar(path) Seq(Row(0)) } } http://git-wip-us.apache.org/repos/asf/spark/blob/f82461fc/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala b/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala index b5b0bb0..c6241d9 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala @@ -63,6 +63,7 @@ private[sql] class SessionState( val optimizer: Optimizer, val planner: SparkPlanner, val streamingQueryManager: StreamingQueryManager, +val resourceLoader: SessionResourceLoader, createQueryExecution: LogicalPlan => QueryExecution, createClone: (SparkSession, SessionState) => SessionState) { @@ -106,27 +107,6 @@ private[sql] class SessionState( def refreshTable(tableName: String): Unit = { catalog.refreshTable(sqlParser.parseTableIdentifier(tableName)) } - - /** - * Add a jar path to [[SparkContext]] and the classloader. - * - * Note: this method seems not access any session state, but the subclass `HiveSessionState` needs - * to add the jar to its hive client for the current session. Hence, it still needs to be in - * [[SessionState]]. - */ - def addJar(path: String): Unit = { -sparkContext.addJar(path) -val uri = new Path(path).toUri -val jarURL = if (uri.getScheme == null) { - // `path` is a local file path without a URL scheme - new File(path).toURI.toURL -} else { - // `path` is a URL with a scheme - uri.toURL -} -sharedState.jarClassLoader.addURL(jarURL) -Thread.currentThread().setContextClassLoader(sharedState.jarClassLoader) - } } private[sql] object SessionState { @@ -160,10 +140,10 @@ class SessionStateBuilder( * Session shared
spark git commit: [SPARK-20124][SQL] Join reorder should keep the same order of final project attributes
Repository: spark Updated Branches: refs/heads/master 91559d277 -> 4fcc214d9 [SPARK-20124][SQL] Join reorder should keep the same order of final project attributes ## What changes were proposed in this pull request? Join reorder algorithm should keep exactly the same order of output attributes in the top project. For example, if user want to select a, b, c, after reordering, we should output a, b, c in the same order as specified by user, instead of b, a, c or other orders. ## How was this patch tested? A new test case is added in `JoinReorderSuite`. Author: wangzhenhuaCloses #17453 from wzhfy/keepOrderInProject. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4fcc214d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4fcc214d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4fcc214d Branch: refs/heads/master Commit: 4fcc214d9eb5e98b2eed3e28cc23b0c511cd9007 Parents: 91559d2 Author: wangzhenhua Authored: Tue Mar 28 22:22:38 2017 +0800 Committer: Wenchen Fan Committed: Tue Mar 28 22:22:38 2017 +0800 -- .../optimizer/CostBasedJoinReorder.scala| 24 +--- .../catalyst/optimizer/JoinReorderSuite.scala | 13 +++ .../spark/sql/catalyst/plans/PlanTest.scala | 4 ++-- 3 files changed, 31 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4fcc214d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala index fc37720..cbd5064 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala @@ -40,10 +40,10 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr val result = plan transformDown { // Start reordering with a joinable item, which is an InnerLike join with conditions. case j @ Join(_, _, _: InnerLike, Some(cond)) => - reorder(j, j.outputSet) + reorder(j, j.output) case p @ Project(projectList, Join(_, _, _: InnerLike, Some(cond))) if projectList.forall(_.isInstanceOf[Attribute]) => - reorder(p, p.outputSet) + reorder(p, p.output) } // After reordering is finished, convert OrderedJoin back to Join result transformDown { @@ -52,7 +52,7 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr } } - private def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { + private def reorder(plan: LogicalPlan, output: Seq[Attribute]): LogicalPlan = { val (items, conditions) = extractInnerJoins(plan) // TODO: Compute the set of star-joins and use them in the join enumeration // algorithm to prune un-optimal plan choices. @@ -140,7 +140,7 @@ object JoinReorderDP extends PredicateHelper with Logging { conf: SQLConf, items: Seq[LogicalPlan], conditions: Set[Expression], - topOutput: AttributeSet): LogicalPlan = { + output: Seq[Attribute]): LogicalPlan = { val startTime = System.nanoTime() // Level i maintains all found plans for i + 1 items. @@ -152,9 +152,10 @@ object JoinReorderDP extends PredicateHelper with Logging { // Build plans for next levels until the last level has only one plan. This plan contains // all items that can be joined, so there's no need to continue. +val topOutputSet = AttributeSet(output) while (foundPlans.size < items.length && foundPlans.last.size > 1) { // Build plans for the next level. - foundPlans += searchLevel(foundPlans, conf, conditions, topOutput) + foundPlans += searchLevel(foundPlans, conf, conditions, topOutputSet) } val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) @@ -163,7 +164,14 @@ object JoinReorderDP extends PredicateHelper with Logging { // The last level must have one and only one plan, because all items are joinable. assert(foundPlans.size == items.length && foundPlans.last.size == 1) -foundPlans.last.head._2.plan +foundPlans.last.head._2.plan match { + case p @ Project(projectList, j: Join) if projectList != output => +assert(topOutputSet == p.outputSet) +// Keep the same order of final output attributes. +p.copy(projectList = output) + case finalPlan => +finalPlan +
spark git commit: [SPARK-20094][SQL] Preventing push down of IN subquery to Join operator
Repository: spark Updated Branches: refs/heads/master a9abff281 -> 91559d277 [SPARK-20094][SQL] Preventing push down of IN subquery to Join operator ## What changes were proposed in this pull request? TPCDS q45 fails becuase: `ReorderJoin` collects all predicates and try to put them into join condition when creating ordered join. If a predicate with an IN subquery (`ListQuery`) is in a join condition instead of a filter condition, `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the subquery to an `ExistenceJoin`, and thus result in error. We should prevent push down of IN subquery to Join operator. ## How was this patch tested? Add a new test case in `FilterPushdownSuite`. Author: wangzhenhuaCloses #17428 from wzhfy/noSubqueryInJoinCond. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91559d27 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91559d27 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91559d27 Branch: refs/heads/master Commit: 91559d277f42ee83b79f5d8eb7ba037cf5c108da Parents: a9abff2 Author: wangzhenhua Authored: Tue Mar 28 13:43:23 2017 +0200 Committer: Herman van Hovell Committed: Tue Mar 28 13:43:23 2017 +0200 -- .../sql/catalyst/expressions/predicates.scala | 6 ++ .../optimizer/FilterPushdownSuite.scala | 20 2 files changed, 26 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/91559d27/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala index e5d1a1e..1235204 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala @@ -90,6 +90,12 @@ trait PredicateHelper { * Returns true iff `expr` could be evaluated as a condition within join. */ protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr match { +case l: ListQuery => + // A ListQuery defines the query which we want to search in an IN subquery expression. + // Currently the only way to evaluate an IN subquery is to convert it to a + // LeftSemi/LeftAnti/ExistenceJoin by `RewritePredicateSubquery` rule. + // It cannot be evaluated as part of a Join operator. + false case e: SubqueryExpression => // non-correlated subquery will be replaced as literal e.children.isEmpty http://git-wip-us.apache.org/repos/asf/spark/blob/91559d27/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala index 6feea40..d846786 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala @@ -836,6 +836,26 @@ class FilterPushdownSuite extends PlanTest { comparePlans(optimized, answer) } + test("SPARK-20094: don't push predicate with IN subquery into join condition") { +val x = testRelation.subquery('x) +val z = testRelation.subquery('z) +val w = testRelation1.subquery('w) + +val queryPlan = x + .join(z) + .where(("x.b".attr === "z.b".attr) && +("x.a".attr > 1 || "z.c".attr.in(ListQuery(w.select("w.d".attr) + .analyze + +val expectedPlan = x + .join(z, Inner, Some("x.b".attr === "z.b".attr)) + .where("x.a".attr > 1 || "z.c".attr.in(ListQuery(w.select("w.d".attr + .analyze + +val optimized = Optimize.execute(queryPlan) +comparePlans(optimized, expectedPlan) + } + test("Window: predicate push down -- basic") { val winExpr = windowExpr(count('b), windowSpec('a :: Nil, 'b.asc :: Nil, UnspecifiedFrame)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20119][TEST-MAVEN] Fix the test case fail in DataSourceScanExecRedactionSuite
Repository: spark Updated Branches: refs/heads/master 6c70a38c2 -> a9abff281 [SPARK-20119][TEST-MAVEN] Fix the test case fail in DataSourceScanExecRedactionSuite ### What changes were proposed in this pull request? Changed the pattern to match the first n characters in the location field so that the string truncation does not affect it. ### How was this patch tested? N/A Author: Xiao LiCloses #17448 from gatorsmile/fixTestCAse. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9abff28 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9abff28 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9abff28 Branch: refs/heads/master Commit: a9abff281bcb15fdc9121c8bcb983a9d91cb Parents: 6c70a38 Author: Xiao Li Authored: Tue Mar 28 09:37:28 2017 +0200 Committer: Herman van Hovell Committed: Tue Mar 28 09:37:28 2017 +0200 -- .../spark/sql/execution/DataSourceScanExecRedactionSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a9abff28/sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala index 986fa87..05a2b2c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala @@ -31,7 +31,7 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { override def beforeAll(): Unit = { sparkConf.set("spark.redaction.string.regex", - "spark-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}") + "file:/[\\w_]+") super.beforeAll() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org