spark git commit: [SPARK-10670] [ML] [Doc] add api reference for ml doc
Repository: spark Updated Branches: refs/heads/master bf4199e26 -> 9b9fe5f7b [SPARK-10670] [ML] [Doc] add api reference for ml doc jira: https://issues.apache.org/jira/browse/SPARK-10670 In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md This JIRA is just for spark.ml, not spark.mllib Author: Yuhao Yang Closes #8901 from hhbyyh/docAPI. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b9fe5f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b9fe5f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b9fe5f7 Branch: refs/heads/master Commit: 9b9fe5f7bf55257269d8febcd64e95677075dfb6 Parents: bf4199e Author: Yuhao Yang Authored: Mon Sep 28 22:40:02 2015 -0700 Committer: Xiangrui Meng Committed: Mon Sep 28 22:40:02 2015 -0700 -- docs/ml-features.md | 259 +++ 1 file changed, 195 insertions(+), 64 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b9fe5f7/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index b70da4a..44a9882 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -28,12 +28,15 @@ The algorithm combines Term Frequency (TF) counts with the [hashing trick](http: **IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency. -For API details, refer to the [HashingTF API docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and the [IDF API docs](api/scala/index.html#org.apache.spark.ml.feature.IDF). In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm. + +Refer to the [HashingTF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and +the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer} @@ -54,6 +57,10 @@ rescaledData.select("features", "label").take(3).foreach(println) + +Refer to the [HashingTF Java docs](api/java/org/apache/spark/ml/feature/HashingTF.html) and the +[IDF Java docs](api/java/org/apache/spark/ml/feature/IDF.html) for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -100,6 +107,10 @@ for (Row r : rescaledData.select("features", "label").take(3)) { + +Refer to the [HashingTF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF) and +the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for more details on the API. + {% highlight python %} from pyspark.ml.feature import HashingTF, IDF, Tokenizer @@ -267,9 +278,11 @@ each vector represents the token counts of the document over the vocabulary. -More details can be found in the API docs for -[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and -[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). + +Refer to the [CountVectorizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) +and the [CountVectorizerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.CountVectorizer import org.apache.spark.mllib.util.CountVectorizerModel @@ -297,9 +310,11 @@ cvModel.transform(df).select("features").show() -More details can be found in the API docs for -[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and -[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). + +Refer to the [CountVectorizer Java docs](api/java/org/apache/spark/ml/feature/CountVectorize
[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
[SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. Author: Sean Owen Closes #8919 from srowen/SPARK-10833. (cherry picked from commit bf4199e261c3c8dd2970e2a154c97b46fb339f02) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b3014bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b3014bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b3014bc Branch: refs/heads/branch-1.5 Commit: 9b3014bc4e0dd2bcfbd7e42af83753393b74a760 Parents: a367840 Author: Sean Owen Authored: Mon Sep 28 22:56:43 2015 -0400 Committer: Sean Owen Committed: Mon Sep 28 22:56:59 2015 -0400 -- LICENSE | 699 +-- NOTICE | 35 + .../apache/spark/util/collection/TimSort.java | 18 + licenses/LICENSE-AnchorJS.txt | 21 + licenses/LICENSE-DPark.txt | 30 + licenses/LICENSE-Mockito.txt| 21 + licenses/LICENSE-SnapTree.txt | 35 + licenses/LICENSE-antlr.txt | 8 + licenses/LICENSE-boto.txt | 20 + licenses/LICENSE-cloudpickle.txt| 28 + licenses/LICENSE-d3.min.js.txt | 26 + licenses/LICENSE-dagre-d3.txt | 19 + licenses/LICENSE-f2j.txt| 8 + licenses/LICENSE-graphlib-dot.txt | 19 + licenses/LICENSE-heapq.txt | 280 licenses/LICENSE-javolution.txt | 27 + licenses/LICENSE-jbcrypt.txt| 17 + licenses/LICENSE-jblas.txt | 31 + licenses/LICENSE-jline.txt | 32 + licenses/LICENSE-jpmml-model.txt| 10 + licenses/LICENSE-jquery.txt | 9 + licenses/LICENSE-junit-interface.txt| 24 + licenses/LICENSE-kryo.txt | 10 + licenses/LICENSE-minlog.txt | 10 + licenses/LICENSE-netlib.txt | 49 ++ licenses/LICENSE-paranamer.txt | 28 + licenses/LICENSE-protobuf.txt | 42 ++ licenses/LICENSE-py4j.txt | 27 + licenses/LICENSE-pyrolite.txt | 28 + licenses/LICENSE-reflectasm.txt | 10 + licenses/LICENSE-sbt-launch-lib.txt | 26 + licenses/LICENSE-scala.txt | 30 + licenses/LICENSE-scalacheck.txt | 32 + licenses/LICENSE-scopt.txt | 21 + licenses/LICENSE-slf4j.txt | 21 + licenses/LICENSE-sorttable.js.txt | 16 + licenses/LICENSE-spire.txt | 19 + licenses/LICENSE-xmlenc.txt | 27 + make-distribution.sh| 1 + .../spark/network/util/LimitedInputStream.java | 18 + 40 files changed, 1153 insertions(+), 679 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/LICENSE -- diff --git a/LICENSE b/LICENSE index f9e412c..dca03ab 100644 --- a/LICENSE +++ b/LICENSE @@ -211,712 +211,45 @@ subcomponents is subject to the terms and conditions of the following licenses. -=== -For the Boto EC2 library (ec2/third_party/boto*.zip): -=== - -Copyright (c) 2006-2008 Mitch Garnaat http://garnaat.org/ - -Permission is hereby granted, free of charge, to any person obtaining a -copy of this software and associated documentation files (the -"Software"), to deal in the Software
[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
Repository: spark Updated Branches: refs/heads/branch-1.5 a36784083 -> 9b3014bc4 http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-jpmml-model.txt -- diff --git a/licenses/LICENSE-jpmml-model.txt b/licenses/LICENSE-jpmml-model.txt new file mode 100644 index 000..69411d1 --- /dev/null +++ b/licenses/LICENSE-jpmml-model.txt @@ -0,0 +1,10 @@ +Copyright (c) 2009, University of Tartu +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. +2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. +3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-jquery.txt -- diff --git a/licenses/LICENSE-jquery.txt b/licenses/LICENSE-jquery.txt new file mode 100644 index 000..e1dd696 --- /dev/null +++ b/licenses/LICENSE-jquery.txt @@ -0,0 +1,9 @@ +The MIT License (MIT) + +Copyright (c) + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-junit-interface.txt -- diff --git a/licenses/LICENSE-junit-interface.txt b/licenses/LICENSE-junit-interface.txt new file mode 100644 index 000..e835350 --- /dev/null +++ b/licenses/LICENSE-junit-interface.txt @@ -0,0 +1,24 @@ +Copyright (c) 2009-2012, Stefan Zeiger +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIAB
[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
Repository: spark Updated Branches: refs/heads/master ea02e5513 -> bf4199e26 http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-jquery.txt -- diff --git a/licenses/LICENSE-jquery.txt b/licenses/LICENSE-jquery.txt new file mode 100644 index 000..e1dd696 --- /dev/null +++ b/licenses/LICENSE-jquery.txt @@ -0,0 +1,9 @@ +The MIT License (MIT) + +Copyright (c) + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-junit-interface.txt -- diff --git a/licenses/LICENSE-junit-interface.txt b/licenses/LICENSE-junit-interface.txt new file mode 100644 index 000..e835350 --- /dev/null +++ b/licenses/LICENSE-junit-interface.txt @@ -0,0 +1,24 @@ +Copyright (c) 2009-2012, Stefan Zeiger +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-kryo.txt -- diff --git a/licenses/LICENSE-kryo.txt b/licenses/LICENSE-kryo.txt new file mode 100644 index 000..3f6a160 --- /dev/null +++ b/licenses/LICENSE-kryo.txt @@ -0,0 +1,10 @@ +Copyright (c) 2008, Nathan Sweet +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. +* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. +* Neither the name of Esoteric Software nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIG
[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
[SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. Author: Sean Owen Closes #8919 from srowen/SPARK-10833. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf4199e2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf4199e2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf4199e2 Branch: refs/heads/master Commit: bf4199e261c3c8dd2970e2a154c97b46fb339f02 Parents: ea02e55 Author: Sean Owen Authored: Mon Sep 28 22:56:43 2015 -0400 Committer: Sean Owen Committed: Mon Sep 28 22:56:43 2015 -0400 -- LICENSE | 699 +-- NOTICE | 35 + .../apache/spark/util/collection/TimSort.java | 18 + licenses/LICENSE-AnchorJS.txt | 21 + licenses/LICENSE-DPark.txt | 30 + licenses/LICENSE-Mockito.txt| 21 + licenses/LICENSE-SnapTree.txt | 35 + licenses/LICENSE-antlr.txt | 8 + licenses/LICENSE-boto.txt | 20 + licenses/LICENSE-cloudpickle.txt| 28 + licenses/LICENSE-d3.min.js.txt | 26 + licenses/LICENSE-dagre-d3.txt | 19 + licenses/LICENSE-f2j.txt| 8 + licenses/LICENSE-graphlib-dot.txt | 19 + licenses/LICENSE-heapq.txt | 280 licenses/LICENSE-javolution.txt | 27 + licenses/LICENSE-jbcrypt.txt| 17 + licenses/LICENSE-jblas.txt | 31 + licenses/LICENSE-jline.txt | 32 + licenses/LICENSE-jpmml-model.txt| 10 + licenses/LICENSE-jquery.txt | 9 + licenses/LICENSE-junit-interface.txt| 24 + licenses/LICENSE-kryo.txt | 10 + licenses/LICENSE-minlog.txt | 10 + licenses/LICENSE-netlib.txt | 49 ++ licenses/LICENSE-paranamer.txt | 28 + licenses/LICENSE-protobuf.txt | 42 ++ licenses/LICENSE-py4j.txt | 27 + licenses/LICENSE-pyrolite.txt | 28 + licenses/LICENSE-reflectasm.txt | 10 + licenses/LICENSE-sbt-launch-lib.txt | 26 + licenses/LICENSE-scala.txt | 30 + licenses/LICENSE-scalacheck.txt | 32 + licenses/LICENSE-scopt.txt | 21 + licenses/LICENSE-slf4j.txt | 21 + licenses/LICENSE-sorttable.js.txt | 16 + licenses/LICENSE-spire.txt | 19 + licenses/LICENSE-xmlenc.txt | 27 + make-distribution.sh| 1 + .../spark/network/util/LimitedInputStream.java | 18 + 40 files changed, 1153 insertions(+), 679 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/LICENSE -- diff --git a/LICENSE b/LICENSE index f9e412c..dca03ab 100644 --- a/LICENSE +++ b/LICENSE @@ -211,712 +211,45 @@ subcomponents is subject to the terms and conditions of the following licenses. -=== -For the Boto EC2 library (ec2/third_party/boto*.zip): -=== - -Copyright (c) 2006-2008 Mitch Garnaat http://garnaat.org/ - -Permission is hereby granted, free of charge, to any person obtaining a -copy of this software and associated documentation files (the -"Software"), to deal in the Software without restriction, including -without limitation the rights to use, copy, modify, merge, publish,
spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache
Repository: spark Updated Branches: refs/heads/branch-1.5 de259316b -> a36784083 [SPARK-10859] [SQL] fix stats of StringType in columnar cache The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu Closes #8929 from davies/pushdown_string. (cherry picked from commit ea02e5513a8f9853094d5612c962fc8c1a340f50) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3678408 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3678408 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3678408 Branch: refs/heads/branch-1.5 Commit: a367840834b97cd6a9ecda568bb21ee6dc35fcde Parents: de25931 Author: Davies Liu Authored: Mon Sep 28 14:40:40 2015 -0700 Committer: Yin Huai Committed: Mon Sep 28 14:40:52 2015 -0700 -- .../scala/org/apache/spark/sql/columnar/ColumnStats.scala | 4 ++-- .../spark/sql/columnar/InMemoryColumnarQuerySuite.scala | 7 +++ 2 files changed, 9 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a3678408/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala index 5cbd52b..fbd51b7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala @@ -213,8 +213,8 @@ private[sql] class StringColumnStats extends ColumnStats { super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getUTF8String(ordinal) - if (upper == null || value.compareTo(upper) > 0) upper = value - if (lower == null || value.compareTo(lower) < 0) lower = value + if (upper == null || value.compareTo(upper) > 0) upper = value.clone() + if (lower == null || value.compareTo(lower) < 0) lower = value.clone() sizeInBytes += STRING.actualSize(row, ordinal) } } http://git-wip-us.apache.org/repos/asf/spark/blob/a3678408/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala index 83db9b6..3a0f346 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala @@ -211,4 +211,11 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { // Drop the cache. cached.unpersist() } + + test("SPARK-10859: Predicates pushed to InMemoryColumnarTableScan are not evaluated correctly") { +val data = sqlContext.range(10).selectExpr("id", "cast(id as string) as s") +data.cache() +assert(data.count() === 10) +assert(data.filter($"s" === "3").count() === 1) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache
Repository: spark Updated Branches: refs/heads/master 14978b785 -> ea02e5513 [SPARK-10859] [SQL] fix stats of StringType in columnar cache The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu Closes #8929 from davies/pushdown_string. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ea02e551 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ea02e551 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ea02e551 Branch: refs/heads/master Commit: ea02e5513a8f9853094d5612c962fc8c1a340f50 Parents: 14978b7 Author: Davies Liu Authored: Mon Sep 28 14:40:40 2015 -0700 Committer: Yin Huai Committed: Mon Sep 28 14:40:40 2015 -0700 -- .../scala/org/apache/spark/sql/columnar/ColumnStats.scala | 4 ++-- .../spark/sql/columnar/InMemoryColumnarQuerySuite.scala | 7 +++ 2 files changed, 9 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ea02e551/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala index 5cbd52b..fbd51b7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala @@ -213,8 +213,8 @@ private[sql] class StringColumnStats extends ColumnStats { super.gatherStats(row, ordinal) if (!row.isNullAt(ordinal)) { val value = row.getUTF8String(ordinal) - if (upper == null || value.compareTo(upper) > 0) upper = value - if (lower == null || value.compareTo(lower) < 0) lower = value + if (upper == null || value.compareTo(upper) > 0) upper = value.clone() + if (lower == null || value.compareTo(lower) < 0) lower = value.clone() sizeInBytes += STRING.actualSize(row, ordinal) } } http://git-wip-us.apache.org/repos/asf/spark/blob/ea02e551/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala index cd3644e..ea5dd2b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala @@ -212,4 +212,11 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { // Drop the cache. cached.unpersist() } + + test("SPARK-10859: Predicates pushed to InMemoryColumnarTableScan are not evaluated correctly") { +val data = sqlContext.range(10).selectExpr("id", "cast(id as string) as s") +data.cache() +assert(data.count() === 10) +assert(data.filter($"s" === "3").count() === 1) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10395] [SQL] Simplifies CatalystReadSupport
Repository: spark Updated Branches: refs/heads/master 353c30bd7 -> 14978b785 [SPARK-10395] [SQL] Simplifies CatalystReadSupport Please refer to [SPARK-10395] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-10395 Author: Cheng Lian Closes #8553 from liancheng/spark-10395/simplify-parquet-read-support. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/14978b78 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/14978b78 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/14978b78 Branch: refs/heads/master Commit: 14978b785a43e0c13c8bdfd52d20cc8984984ba3 Parents: 353c30b Author: Cheng Lian Authored: Mon Sep 28 13:53:45 2015 -0700 Committer: Davies Liu Committed: Mon Sep 28 13:53:45 2015 -0700 -- .../parquet/CatalystReadSupport.scala | 92 ++-- 1 file changed, 45 insertions(+), 47 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/14978b78/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala index 8c819f1..9502b83 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala @@ -19,7 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet import java.util.{Map => JMap} -import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsJavaMapConverter, mapAsScalaMapConverter} +import scala.collection.JavaConverters._ import org.apache.hadoop.conf.Configuration import org.apache.parquet.hadoop.api.ReadSupport.ReadContext @@ -29,34 +29,62 @@ import org.apache.parquet.schema.Type.Repetition import org.apache.parquet.schema._ import org.apache.spark.Logging +import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.types._ +/** + * A Parquet [[ReadSupport]] implementation for reading Parquet records as Catalyst + * [[InternalRow]]s. + * + * The API interface of [[ReadSupport]] is a little bit over complicated because of historical + * reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), [[ReadSupport]] need to be + * instantiated and initialized twice on both driver side and executor side. The [[init()]] method + * is for driver side initialization, while [[prepareForRead()]] is for executor side. However, + * starting from parquet-mr 1.6.0, it's no longer the case, and [[ReadSupport]] is only instantiated + * and initialized on executor side. So, theoretically, now it's totally fine to combine these two + * methods into a single initialization method. The only reason (I could think of) to still have + * them here is for parquet-mr API backwards-compatibility. + * + * Due to this reason, we no longer rely on [[ReadContext]] to pass requested schema from [[init()]] + * to [[prepareForRead()]], but use a private `var` for simplicity. + */ private[parquet] class CatalystReadSupport extends ReadSupport[InternalRow] with Logging { - // Called after `init()` when initializing Parquet record reader. + private var catalystRequestedSchema: StructType = _ + + /** + * Called on executor side before [[prepareForRead()]] and instantiating actual Parquet record + * readers. Responsible for figuring out Parquet requested schema used for column pruning. + */ + override def init(context: InitContext): ReadContext = { +catalystRequestedSchema = { + // scalastyle:off jobcontext + val conf = context.getConfiguration + // scalastyle:on jobcontext + val schemaString = conf.get(CatalystReadSupport.SPARK_ROW_REQUESTED_SCHEMA) + assert(schemaString != null, "Parquet requested schema not set.") + StructType.fromString(schemaString) +} + +val parquetRequestedSchema = + CatalystReadSupport.clipParquetSchema(context.getFileSchema, catalystRequestedSchema) + +new ReadContext(parquetRequestedSchema, Map.empty[String, String].asJava) + } + + /** + * Called on executor side after [[init()]], before instantiating actual Parquet record readers. + * Responsible for instantiating [[RecordMaterializer]], which is used for converting Parquet + * records to Catalyst [[InternalRow]]s. + */ override def prepareForRead( conf: Configuration, keyValueMetaData: JMap[String, String], fileSchema: MessageType, readContext: ReadContext): Recor
spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes
Repository: spark Updated Branches: refs/heads/branch-1.5 e0c3212a9 -> de259316b [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao Closes #8910 from jerryshao/SPARK-10790. (cherry picked from commit 353c30bd7dfbd3b76fc8bc9a6dfab9321439a34b) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de259316 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de259316 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de259316 Branch: refs/heads/branch-1.5 Commit: de259316b491762dbcffd1667b669f909125dd13 Parents: e0c3212 Author: jerryshao Authored: Mon Sep 28 06:38:54 2015 -0700 Committer: Marcelo Vanzin Committed: Mon Sep 28 06:39:13 2015 -0700 -- .../spark/deploy/yarn/ClientArguments.scala | 20 + .../spark/deploy/yarn/YarnAllocator.scala | 6 + .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 23 .../cluster/YarnClusterSchedulerBackend.scala | 18 ++- 4 files changed, 27 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala index 54f62e6..1165061 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala @@ -81,25 +81,7 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) .orNull // If dynamic allocation is enabled, start at the configured initial number of executors. // Default to minExecutors if no initialExecutors is set. -if (isDynamicAllocationEnabled) { - val minExecutorsConf = "spark.dynamicAllocation.minExecutors" - val initialExecutorsConf = "spark.dynamicAllocation.initialExecutors" - val maxExecutorsConf = "spark.dynamicAllocation.maxExecutors" - val minNumExecutors = sparkConf.getInt(minExecutorsConf, 0) - val initialNumExecutors = sparkConf.getInt(initialExecutorsConf, minNumExecutors) - val maxNumExecutors = sparkConf.getInt(maxExecutorsConf, Integer.MAX_VALUE) - - // If defined, initial executors must be between min and max - if (initialNumExecutors < minNumExecutors || initialNumExecutors > maxNumExecutors) { -throw new IllegalArgumentException( - s"$initialExecutorsConf must be between $minExecutorsConf and $maxNumExecutors!") - } - - numExecutors = initialNumExecutors -} else { - val numExecutorsConf = "spark.executor.instances" - numExecutors = sparkConf.getInt(numExecutorsConf, numExecutors) -} +numExecutors = YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf) principal = Option(principal) .orElse(sparkConf.getOption("spark.yarn.principal")) .orNull http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala index ccf753e..6a02848 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala @@ -89,11 +89,7 @@ private[yarn] class YarnAllocator( @volatile private var numExecutorsFailed = 0 @volatile private var targetNumExecutors = -if (Utils.isDynamicAllocationEnabled(sparkConf)) { - sparkConf.getInt("spark.dynamicAllocation.initialExecutors", 0) -} else { - sparkConf.getInt("spark.executor.instances", YarnSparkHadoopUtil.DEFAULT_NUMBER_EXECUTORS) -} +YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf) // Keep track of which container is running which executor to remove the executors later // Visible for testing. http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala ---
spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes
Repository: spark Updated Branches: refs/heads/master d8d50ed38 -> 353c30bd7 [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao Closes #8910 from jerryshao/SPARK-10790. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/353c30bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/353c30bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/353c30bd Branch: refs/heads/master Commit: 353c30bd7dfbd3b76fc8bc9a6dfab9321439a34b Parents: d8d50ed Author: jerryshao Authored: Mon Sep 28 06:38:54 2015 -0700 Committer: Marcelo Vanzin Committed: Mon Sep 28 06:38:54 2015 -0700 -- .../spark/deploy/yarn/ClientArguments.scala | 20 + .../spark/deploy/yarn/YarnAllocator.scala | 6 + .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 23 .../cluster/YarnClusterSchedulerBackend.scala | 18 ++- 4 files changed, 27 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala index 54f62e6..1165061 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala @@ -81,25 +81,7 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) .orNull // If dynamic allocation is enabled, start at the configured initial number of executors. // Default to minExecutors if no initialExecutors is set. -if (isDynamicAllocationEnabled) { - val minExecutorsConf = "spark.dynamicAllocation.minExecutors" - val initialExecutorsConf = "spark.dynamicAllocation.initialExecutors" - val maxExecutorsConf = "spark.dynamicAllocation.maxExecutors" - val minNumExecutors = sparkConf.getInt(minExecutorsConf, 0) - val initialNumExecutors = sparkConf.getInt(initialExecutorsConf, minNumExecutors) - val maxNumExecutors = sparkConf.getInt(maxExecutorsConf, Integer.MAX_VALUE) - - // If defined, initial executors must be between min and max - if (initialNumExecutors < minNumExecutors || initialNumExecutors > maxNumExecutors) { -throw new IllegalArgumentException( - s"$initialExecutorsConf must be between $minExecutorsConf and $maxNumExecutors!") - } - - numExecutors = initialNumExecutors -} else { - val numExecutorsConf = "spark.executor.instances" - numExecutors = sparkConf.getInt(numExecutorsConf, numExecutors) -} +numExecutors = YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf) principal = Option(principal) .orElse(sparkConf.getOption("spark.yarn.principal")) .orNull http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala index fd88b8b..9e1ef1b 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala @@ -89,11 +89,7 @@ private[yarn] class YarnAllocator( @volatile private var numExecutorsFailed = 0 @volatile private var targetNumExecutors = -if (Utils.isDynamicAllocationEnabled(sparkConf)) { - sparkConf.getInt("spark.dynamicAllocation.initialExecutors", 0) -} else { - sparkConf.getInt("spark.executor.instances", YarnSparkHadoopUtil.DEFAULT_NUMBER_EXECUTORS) -} +YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf) // Executor loss reason requests that are pending - maps from executor ID for inquiry to a // list of requesters that should be responded to once we find out why the given executor http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala -- diff --git
spark git commit: [SPARK-10812] [YARN] Spark hadoop util support switching to yarn
Repository: spark Updated Branches: refs/heads/master b58249930 -> d8d50ed38 [SPARK-10812] [YARN] Spark hadoop util support switching to yarn While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED *** [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8d50ed3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8d50ed3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8d50ed3 Branch: refs/heads/master Commit: d8d50ed388d2e695b69d2b93a620045ef2f0bc18 Parents: b582499 Author: Holden Karau Authored: Mon Sep 28 06:33:45 2015 -0700 Committer: Marcelo Vanzin Committed: Mon Sep 28 06:33:45 2015 -0700 -- .../scala/org/apache/spark/SparkContext.scala | 2 ++ .../apache/spark/deploy/SparkHadoopUtil.scala | 30 ++-- .../org/apache/spark/deploy/yarn/Client.scala | 6 +++- .../deploy/yarn/YarnSparkHadoopUtilSuite.scala | 12 4 files changed, 34 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d8d50ed3/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index bf3aeb4..0c72adf 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1756,6 +1756,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } SparkEnv.set(null) } +// Unset YARN mode system env variable, to allow switching between cluster types. +System.clearProperty("SPARK_YARN_MODE") SparkContext.clearActiveContext() logInfo("Successfully stopped SparkContext") } http://git-wip-us.apache.org/repos/asf/spark/blob/d8d50ed3/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index a0b7365..d606b80 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -385,20 +385,13 @@ class SparkHadoopUtil extends Logging { object SparkHadoopUtil { - private val hadoop = { -val yarnMode = java.lang.Boolean.valueOf( -System.getProperty("SPARK_YARN_MODE", System.getenv("SPARK_YARN_MODE"))) -if (yarnMode) { - try { -Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") - .newInstance() - .asInstanceOf[SparkHadoopUtil] - } catch { - case e: Exception => throw new SparkException("Unable to load YARN support", e) - } -} else { - new SparkHadoopUtil -} + private lazy val hadoop = new SparkHadoopUtil + priva
spark git commit: Fix two mistakes in programming-guide page
Repository: spark Updated Branches: refs/heads/master fb4c7be74 -> b58249930 Fix two mistakes in programming-guide page seperate -> separate sees -> see Author: David Martin Closes #8928 from dmartinpro/patch-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5824993 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5824993 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5824993 Branch: refs/heads/master Commit: b58249930d58e2de238c05aaf5fa9315b4c3cbab Parents: fb4c7be Author: David Martin Authored: Mon Sep 28 10:41:39 2015 +0100 Committer: Sean Owen Committed: Mon Sep 28 10:41:39 2015 +0100 -- docs/programming-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b5824993/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 8ad2383..22656fd 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -805,9 +805,9 @@ print("Counter value: " + counter) The primary challenge is that the behavior of the above code is undefined. In local mode with a single JVM, the above code will sum the values within the RDD and store it in **counter**. This is because both the RDD and the variable **counter** are in the same memory space on the driver node. -However, in `cluster` mode, what happens is more complicated, and the above may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks - each of which is operated on by an executor. Prior to execution, Spark computes the **closure**. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case `foreach()`). This closure is serialized and sent to each executor. In `local` mode, there is only the one executors so everything shares the same closure. In other modes however, this is not the case and the executors running on seperate worker nodes each have their own copy of the closure. +However, in `cluster` mode, what happens is more complicated, and the above may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks - each of which is operated on by an executor. Prior to execution, Spark computes the **closure**. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case `foreach()`). This closure is serialized and sent to each executor. In `local` mode, there is only the one executors so everything shares the same closure. In other modes however, this is not the case and the executors running on separate worker nodes each have their own copy of the closure. -What is happening here is that the variables within the closure sent to each executor are now copies and thus, when **counter** is referenced within the `foreach` function, it's no longer the **counter** on the driver node. There is still a **counter** in the memory of the driver node but this is no longer visible to the executors! The executors only sees the copy from the serialized closure. Thus, the final value of **counter** will still be zero since all operations on **counter** were referencing the value within the serialized closure. +What is happening here is that the variables within the closure sent to each executor are now copies and thus, when **counter** is referenced within the `foreach` function, it's no longer the **counter** on the driver node. There is still a **counter** in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of **counter** will still be zero since all operations on **counter** were referencing the value within the serialized closure. To ensure well-defined behavior in these sorts of scenarios one should use an [`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org