date:20150928

spark git commit: [SPARK-10670] [ML] [Doc] add api reference for ml doc

2015-09-28 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master bf4199e26 -> 9b9fe5f7b


[SPARK-10670] [ML] [Doc] add api reference for ml doc

jira: https://issues.apache.org/jira/browse/SPARK-10670
In the Markdown docs for the spark.ml Programming Guide, we have code examples 
with codetabs for each language. We should link to each language's API docs 
within the corresponding codetab, but we are inconsistent about this. For an 
example of what we want to do, see the "Word2Vec" section in 
https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
This JIRA is just for spark.ml, not spark.mllib

Author: Yuhao Yang 

Closes #8901 from hhbyyh/docAPI.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b9fe5f7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b9fe5f7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b9fe5f7

Branch: refs/heads/master
Commit: 9b9fe5f7bf55257269d8febcd64e95677075dfb6
Parents: bf4199e
Author: Yuhao Yang 
Authored: Mon Sep 28 22:40:02 2015 -0700
Committer: Xiangrui Meng 
Committed: Mon Sep 28 22:40:02 2015 -0700

--
 docs/ml-features.md | 259 +++
 1 file changed, 195 insertions(+), 64 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b9fe5f7/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index b70da4a..44a9882 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -28,12 +28,15 @@ The algorithm combines Term Frequency (TF) counts with the 
[hashing trick](http:
 **IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an 
`IDFModel`.  The `IDFModel` takes feature vectors (generally created from 
`HashingTF`) and scales each column.  Intuitively, it down-weights columns 
which appear frequently in a corpus.
 
 Please refer to the [MLlib user guide on 
TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term 
Frequency and Inverse Document Frequency.
-For API details, refer to the [HashingTF API 
docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and the [IDF 
API docs](api/scala/index.html#org.apache.spark.ml.feature.IDF).
 
 In the following code segment, we start with a set of sentences.  We split 
each sentence into words using `Tokenizer`.  For each sentence (bag of words), 
we use `HashingTF` to hash the sentence into a feature vector.  We use `IDF` to 
rescale the feature vectors; this generally improves performance when using 
text as features.  Our feature vectors could then be passed to a learning 
algorithm.
 
 
 
+
+Refer to the [HashingTF Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and
+the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for 
more details on the API.
+
 {% highlight scala %}
 import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
 
@@ -54,6 +57,10 @@ rescaledData.select("features", 
"label").take(3).foreach(println)
 
 
 
+
+Refer to the [HashingTF Java 
docs](api/java/org/apache/spark/ml/feature/HashingTF.html) and the
+[IDF Java docs](api/java/org/apache/spark/ml/feature/IDF.html) for more 
details on the API.
+
 {% highlight java %}
 import java.util.Arrays;
 
@@ -100,6 +107,10 @@ for (Row r : rescaledData.select("features", 
"label").take(3)) {
 
 
 
+
+Refer to the [HashingTF Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF) and
+the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for 
more details on the API.
+
 {% highlight python %}
 from pyspark.ml.feature import HashingTF, IDF, Tokenizer
 
@@ -267,9 +278,11 @@ each vector represents the token counts of the document 
over the vocabulary.
 
 
 
-More details can be found in the API docs for
-[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and
-[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+
+Refer to the [CountVectorizer Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
+and the [CountVectorizerModel Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel)
+for more details on the API.
+
 {% highlight scala %}
 import org.apache.spark.ml.feature.CountVectorizer
 import org.apache.spark.mllib.util.CountVectorizerModel
@@ -297,9 +310,11 @@ cvModel.transform(df).select("features").show()
 
 
 
-More details can be found in the API docs for
-[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and
-[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+
+Refer to the [CountVectorizer Java 
docs](api/java/org/apache/spark/ml/feature/CountVectorize

[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

2015-09-28 Thread srowen

[SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
light that the guidance at 
http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
permissively-licensed dependencies has a different interpretation than we (er, 
I) had been operating under. "pointer ... to the license within the source 
tree" specifically means a copy of the license within Spark's distribution, 
whereas at the moment, Spark's LICENSE has a pointer to the project's license 
in the other project's source tree.

The remedy is simply to inline all such license references (i.e. BSD/MIT 
licenses) or include their text in "licenses" subdirectory and point to that.

Along the way, we can also treat other BSD/MIT licenses, whose text has been 
inlined into LICENSE, in the same way.

The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
projects and a pointer to their sites. This would be over and above including 
license text in the distro, which is the essential thing.

Author: Sean Owen 

Closes #8919 from srowen/SPARK-10833.

(cherry picked from commit bf4199e261c3c8dd2970e2a154c97b46fb339f02)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b3014bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b3014bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b3014bc

Branch: refs/heads/branch-1.5
Commit: 9b3014bc4e0dd2bcfbd7e42af83753393b74a760
Parents: a367840
Author: Sean Owen 
Authored: Mon Sep 28 22:56:43 2015 -0400
Committer: Sean Owen 
Committed: Mon Sep 28 22:56:59 2015 -0400

--
 LICENSE | 699 +--
 NOTICE  |  35 +
 .../apache/spark/util/collection/TimSort.java   |  18 +
 licenses/LICENSE-AnchorJS.txt   |  21 +
 licenses/LICENSE-DPark.txt  |  30 +
 licenses/LICENSE-Mockito.txt|  21 +
 licenses/LICENSE-SnapTree.txt   |  35 +
 licenses/LICENSE-antlr.txt  |   8 +
 licenses/LICENSE-boto.txt   |  20 +
 licenses/LICENSE-cloudpickle.txt|  28 +
 licenses/LICENSE-d3.min.js.txt  |  26 +
 licenses/LICENSE-dagre-d3.txt   |  19 +
 licenses/LICENSE-f2j.txt|   8 +
 licenses/LICENSE-graphlib-dot.txt   |  19 +
 licenses/LICENSE-heapq.txt  | 280 
 licenses/LICENSE-javolution.txt |  27 +
 licenses/LICENSE-jbcrypt.txt|  17 +
 licenses/LICENSE-jblas.txt  |  31 +
 licenses/LICENSE-jline.txt  |  32 +
 licenses/LICENSE-jpmml-model.txt|  10 +
 licenses/LICENSE-jquery.txt |   9 +
 licenses/LICENSE-junit-interface.txt|  24 +
 licenses/LICENSE-kryo.txt   |  10 +
 licenses/LICENSE-minlog.txt |  10 +
 licenses/LICENSE-netlib.txt |  49 ++
 licenses/LICENSE-paranamer.txt  |  28 +
 licenses/LICENSE-protobuf.txt   |  42 ++
 licenses/LICENSE-py4j.txt   |  27 +
 licenses/LICENSE-pyrolite.txt   |  28 +
 licenses/LICENSE-reflectasm.txt |  10 +
 licenses/LICENSE-sbt-launch-lib.txt |  26 +
 licenses/LICENSE-scala.txt  |  30 +
 licenses/LICENSE-scalacheck.txt |  32 +
 licenses/LICENSE-scopt.txt  |  21 +
 licenses/LICENSE-slf4j.txt  |  21 +
 licenses/LICENSE-sorttable.js.txt   |  16 +
 licenses/LICENSE-spire.txt  |  19 +
 licenses/LICENSE-xmlenc.txt |  27 +
 make-distribution.sh|   1 +
 .../spark/network/util/LimitedInputStream.java  |  18 +
 40 files changed, 1153 insertions(+), 679 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/LICENSE
--
diff --git a/LICENSE b/LICENSE
index f9e412c..dca03ab 100644
--- a/LICENSE
+++ b/LICENSE
@@ -211,712 +211,45 @@ subcomponents is subject to the terms and conditions of 
the following
 licenses.
 
 
-===
-For the Boto EC2 library (ec2/third_party/boto*.zip):
-===
-
-Copyright (c) 2006-2008 Mitch Garnaat http://garnaat.org/
-
-Permission is hereby granted, free of charge, to any person obtaining a
-copy of this software and associated documentation files (the
-"Software"), to deal in the Software

[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

2015-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 a36784083 -> 9b3014bc4


http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-jpmml-model.txt
--
diff --git a/licenses/LICENSE-jpmml-model.txt b/licenses/LICENSE-jpmml-model.txt
new file mode 100644
index 000..69411d1
--- /dev/null
+++ b/licenses/LICENSE-jpmml-model.txt
@@ -0,0 +1,10 @@
+Copyright (c) 2009, University of Tartu
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this 
list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, 
this list of conditions and the following disclaimer in the documentation 
and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors 
may be used to endorse or promote products derived from this software without 
specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-jquery.txt
--
diff --git a/licenses/LICENSE-jquery.txt b/licenses/LICENSE-jquery.txt
new file mode 100644
index 000..e1dd696
--- /dev/null
+++ b/licenses/LICENSE-jquery.txt
@@ -0,0 +1,9 @@
+The MIT License (MIT)
+
+Copyright (c)  
+
+Permission is hereby granted, free of charge, to any person obtaining a copy 
of this software and associated documentation files (the "Software"), to deal 
in the Software without restriction, including without limitation the rights to 
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 
of the Software, and to permit persons to whom the Software is furnished to do 
so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all 
copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
SOFTWARE.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/9b3014bc/licenses/LICENSE-junit-interface.txt
--
diff --git a/licenses/LICENSE-junit-interface.txt 
b/licenses/LICENSE-junit-interface.txt
new file mode 100644
index 000..e835350
--- /dev/null
+++ b/licenses/LICENSE-junit-interface.txt
@@ -0,0 +1,24 @@
+Copyright (c) 2009-2012, Stefan Zeiger
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice,
+  this list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIAB

[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

2015-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master ea02e5513 -> bf4199e26


http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-jquery.txt
--
diff --git a/licenses/LICENSE-jquery.txt b/licenses/LICENSE-jquery.txt
new file mode 100644
index 000..e1dd696
--- /dev/null
+++ b/licenses/LICENSE-jquery.txt
@@ -0,0 +1,9 @@
+The MIT License (MIT)
+
+Copyright (c)  
+
+Permission is hereby granted, free of charge, to any person obtaining a copy 
of this software and associated documentation files (the "Software"), to deal 
in the Software without restriction, including without limitation the rights to 
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 
of the Software, and to permit persons to whom the Software is furnished to do 
so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all 
copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
SOFTWARE.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-junit-interface.txt
--
diff --git a/licenses/LICENSE-junit-interface.txt 
b/licenses/LICENSE-junit-interface.txt
new file mode 100644
index 000..e835350
--- /dev/null
+++ b/licenses/LICENSE-junit-interface.txt
@@ -0,0 +1,24 @@
+Copyright (c) 2009-2012, Stefan Zeiger
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice,
+  this list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/licenses/LICENSE-kryo.txt
--
diff --git a/licenses/LICENSE-kryo.txt b/licenses/LICENSE-kryo.txt
new file mode 100644
index 000..3f6a160
--- /dev/null
+++ b/licenses/LICENSE-kryo.txt
@@ -0,0 +1,10 @@
+Copyright (c) 2008, Nathan Sweet
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, 
this list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright 
notice, this list of conditions and the following disclaimer in the 
documentation and/or other materials provided with the distribution.
+* Neither the name of Esoteric Software nor the names of its contributors 
may be used to endorse or promote products derived from this software without 
specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 
OR TORT (INCLUDING NEGLIG

[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

2015-09-28 Thread srowen

[SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
light that the guidance at 
http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
permissively-licensed dependencies has a different interpretation than we (er, 
I) had been operating under. "pointer ... to the license within the source 
tree" specifically means a copy of the license within Spark's distribution, 
whereas at the moment, Spark's LICENSE has a pointer to the project's license 
in the other project's source tree.

The remedy is simply to inline all such license references (i.e. BSD/MIT 
licenses) or include their text in "licenses" subdirectory and point to that.

Along the way, we can also treat other BSD/MIT licenses, whose text has been 
inlined into LICENSE, in the same way.

The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
projects and a pointer to their sites. This would be over and above including 
license text in the distro, which is the essential thing.

Author: Sean Owen 

Closes #8919 from srowen/SPARK-10833.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf4199e2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf4199e2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf4199e2

Branch: refs/heads/master
Commit: bf4199e261c3c8dd2970e2a154c97b46fb339f02
Parents: ea02e55
Author: Sean Owen 
Authored: Mon Sep 28 22:56:43 2015 -0400
Committer: Sean Owen 
Committed: Mon Sep 28 22:56:43 2015 -0400

--
 LICENSE | 699 +--
 NOTICE  |  35 +
 .../apache/spark/util/collection/TimSort.java   |  18 +
 licenses/LICENSE-AnchorJS.txt   |  21 +
 licenses/LICENSE-DPark.txt  |  30 +
 licenses/LICENSE-Mockito.txt|  21 +
 licenses/LICENSE-SnapTree.txt   |  35 +
 licenses/LICENSE-antlr.txt  |   8 +
 licenses/LICENSE-boto.txt   |  20 +
 licenses/LICENSE-cloudpickle.txt|  28 +
 licenses/LICENSE-d3.min.js.txt  |  26 +
 licenses/LICENSE-dagre-d3.txt   |  19 +
 licenses/LICENSE-f2j.txt|   8 +
 licenses/LICENSE-graphlib-dot.txt   |  19 +
 licenses/LICENSE-heapq.txt  | 280 
 licenses/LICENSE-javolution.txt |  27 +
 licenses/LICENSE-jbcrypt.txt|  17 +
 licenses/LICENSE-jblas.txt  |  31 +
 licenses/LICENSE-jline.txt  |  32 +
 licenses/LICENSE-jpmml-model.txt|  10 +
 licenses/LICENSE-jquery.txt |   9 +
 licenses/LICENSE-junit-interface.txt|  24 +
 licenses/LICENSE-kryo.txt   |  10 +
 licenses/LICENSE-minlog.txt |  10 +
 licenses/LICENSE-netlib.txt |  49 ++
 licenses/LICENSE-paranamer.txt  |  28 +
 licenses/LICENSE-protobuf.txt   |  42 ++
 licenses/LICENSE-py4j.txt   |  27 +
 licenses/LICENSE-pyrolite.txt   |  28 +
 licenses/LICENSE-reflectasm.txt |  10 +
 licenses/LICENSE-sbt-launch-lib.txt |  26 +
 licenses/LICENSE-scala.txt  |  30 +
 licenses/LICENSE-scalacheck.txt |  32 +
 licenses/LICENSE-scopt.txt  |  21 +
 licenses/LICENSE-slf4j.txt  |  21 +
 licenses/LICENSE-sorttable.js.txt   |  16 +
 licenses/LICENSE-spire.txt  |  19 +
 licenses/LICENSE-xmlenc.txt |  27 +
 make-distribution.sh|   1 +
 .../spark/network/util/LimitedInputStream.java  |  18 +
 40 files changed, 1153 insertions(+), 679 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bf4199e2/LICENSE
--
diff --git a/LICENSE b/LICENSE
index f9e412c..dca03ab 100644
--- a/LICENSE
+++ b/LICENSE
@@ -211,712 +211,45 @@ subcomponents is subject to the terms and conditions of 
the following
 licenses.
 
 
-===
-For the Boto EC2 library (ec2/third_party/boto*.zip):
-===
-
-Copyright (c) 2006-2008 Mitch Garnaat http://garnaat.org/
-
-Permission is hereby granted, free of charge, to any person obtaining a
-copy of this software and associated documentation files (the
-"Software"), to deal in the Software without restriction, including
-without limitation the rights to use, copy, modify, merge, publish,

spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache

2015-09-28 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 de259316b -> a36784083


[SPARK-10859] [SQL] fix stats of StringType in columnar cache

The UTF8String may come from UnsafeRow, then underline buffer of it is not 
copied, so we should clone it in order to hold it in Stats.

cc yhuai

Author: Davies Liu 

Closes #8929 from davies/pushdown_string.

(cherry picked from commit ea02e5513a8f9853094d5612c962fc8c1a340f50)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3678408
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3678408
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3678408

Branch: refs/heads/branch-1.5
Commit: a367840834b97cd6a9ecda568bb21ee6dc35fcde
Parents: de25931
Author: Davies Liu 
Authored: Mon Sep 28 14:40:40 2015 -0700
Committer: Yin Huai 
Committed: Mon Sep 28 14:40:52 2015 -0700

--
 .../scala/org/apache/spark/sql/columnar/ColumnStats.scala | 4 ++--
 .../spark/sql/columnar/InMemoryColumnarQuerySuite.scala   | 7 +++
 2 files changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a3678408/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
index 5cbd52b..fbd51b7 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
@@ -213,8 +213,8 @@ private[sql] class StringColumnStats extends ColumnStats {
 super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getUTF8String(ordinal)
-  if (upper == null || value.compareTo(upper) > 0) upper = value
-  if (lower == null || value.compareTo(lower) < 0) lower = value
+  if (upper == null || value.compareTo(upper) > 0) upper = value.clone()
+  if (lower == null || value.compareTo(lower) < 0) lower = value.clone()
   sizeInBytes += STRING.actualSize(row, ordinal)
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/a3678408/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
index 83db9b6..3a0f346 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
@@ -211,4 +211,11 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
 // Drop the cache.
 cached.unpersist()
   }
+
+  test("SPARK-10859: Predicates pushed to InMemoryColumnarTableScan are not 
evaluated correctly") {
+val data = sqlContext.range(10).selectExpr("id", "cast(id as string) as s")
+data.cache()
+assert(data.count() === 10)
+assert(data.filter($"s" === "3").count() === 1)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache

2015-09-28 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master 14978b785 -> ea02e5513


[SPARK-10859] [SQL] fix stats of StringType in columnar cache

The UTF8String may come from UnsafeRow, then underline buffer of it is not 
copied, so we should clone it in order to hold it in Stats.

cc yhuai

Author: Davies Liu 

Closes #8929 from davies/pushdown_string.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ea02e551
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ea02e551
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ea02e551

Branch: refs/heads/master
Commit: ea02e5513a8f9853094d5612c962fc8c1a340f50
Parents: 14978b7
Author: Davies Liu 
Authored: Mon Sep 28 14:40:40 2015 -0700
Committer: Yin Huai 
Committed: Mon Sep 28 14:40:40 2015 -0700

--
 .../scala/org/apache/spark/sql/columnar/ColumnStats.scala | 4 ++--
 .../spark/sql/columnar/InMemoryColumnarQuerySuite.scala   | 7 +++
 2 files changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ea02e551/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
index 5cbd52b..fbd51b7 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
@@ -213,8 +213,8 @@ private[sql] class StringColumnStats extends ColumnStats {
 super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getUTF8String(ordinal)
-  if (upper == null || value.compareTo(upper) > 0) upper = value
-  if (lower == null || value.compareTo(lower) < 0) lower = value
+  if (upper == null || value.compareTo(upper) > 0) upper = value.clone()
+  if (lower == null || value.compareTo(lower) < 0) lower = value.clone()
   sizeInBytes += STRING.actualSize(row, ordinal)
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/ea02e551/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
index cd3644e..ea5dd2b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/columnar/InMemoryColumnarQuerySuite.scala
@@ -212,4 +212,11 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
 // Drop the cache.
 cached.unpersist()
   }
+
+  test("SPARK-10859: Predicates pushed to InMemoryColumnarTableScan are not 
evaluated correctly") {
+val data = sqlContext.range(10).selectExpr("id", "cast(id as string) as s")
+data.cache()
+assert(data.count() === 10)
+assert(data.filter($"s" === "3").count() === 1)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10395] [SQL] Simplifies CatalystReadSupport

2015-09-28 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 353c30bd7 -> 14978b785


[SPARK-10395] [SQL] Simplifies CatalystReadSupport

Please refer to [SPARK-10395] [1] for details.

[1]: https://issues.apache.org/jira/browse/SPARK-10395

Author: Cheng Lian 

Closes #8553 from liancheng/spark-10395/simplify-parquet-read-support.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/14978b78
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/14978b78
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/14978b78

Branch: refs/heads/master
Commit: 14978b785a43e0c13c8bdfd52d20cc8984984ba3
Parents: 353c30b
Author: Cheng Lian 
Authored: Mon Sep 28 13:53:45 2015 -0700
Committer: Davies Liu 
Committed: Mon Sep 28 13:53:45 2015 -0700

--
 .../parquet/CatalystReadSupport.scala   | 92 ++--
 1 file changed, 45 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/14978b78/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
index 8c819f1..9502b83 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
@@ -19,7 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet
 
 import java.util.{Map => JMap}
 
-import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, 
mapAsJavaMapConverter, mapAsScalaMapConverter}
+import scala.collection.JavaConverters._
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.parquet.hadoop.api.ReadSupport.ReadContext
@@ -29,34 +29,62 @@ import org.apache.parquet.schema.Type.Repetition
 import org.apache.parquet.schema._
 
 import org.apache.spark.Logging
+import org.apache.spark.deploy.SparkHadoopUtil
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.types._
 
+/**
+ * A Parquet [[ReadSupport]] implementation for reading Parquet records as 
Catalyst
+ * [[InternalRow]]s.
+ *
+ * The API interface of [[ReadSupport]] is a little bit over complicated 
because of historical
+ * reasons.  In older versions of parquet-mr (say 1.6.0rc3 and prior), 
[[ReadSupport]] need to be
+ * instantiated and initialized twice on both driver side and executor side.  
The [[init()]] method
+ * is for driver side initialization, while [[prepareForRead()]] is for 
executor side.  However,
+ * starting from parquet-mr 1.6.0, it's no longer the case, and 
[[ReadSupport]] is only instantiated
+ * and initialized on executor side.  So, theoretically, now it's totally fine 
to combine these two
+ * methods into a single initialization method.  The only reason (I could 
think of) to still have
+ * them here is for parquet-mr API backwards-compatibility.
+ *
+ * Due to this reason, we no longer rely on [[ReadContext]] to pass requested 
schema from [[init()]]
+ * to [[prepareForRead()]], but use a private `var` for simplicity.
+ */
 private[parquet] class CatalystReadSupport extends ReadSupport[InternalRow] 
with Logging {
-  // Called after `init()` when initializing Parquet record reader.
+  private var catalystRequestedSchema: StructType = _
+
+  /**
+   * Called on executor side before [[prepareForRead()]] and instantiating 
actual Parquet record
+   * readers.  Responsible for figuring out Parquet requested schema used for 
column pruning.
+   */
+  override def init(context: InitContext): ReadContext = {
+catalystRequestedSchema = {
+  // scalastyle:off jobcontext
+  val conf = context.getConfiguration
+  // scalastyle:on jobcontext
+  val schemaString = 
conf.get(CatalystReadSupport.SPARK_ROW_REQUESTED_SCHEMA)
+  assert(schemaString != null, "Parquet requested schema not set.")
+  StructType.fromString(schemaString)
+}
+
+val parquetRequestedSchema =
+  CatalystReadSupport.clipParquetSchema(context.getFileSchema, 
catalystRequestedSchema)
+
+new ReadContext(parquetRequestedSchema, Map.empty[String, String].asJava)
+  }
+
+  /**
+   * Called on executor side after [[init()]], before instantiating actual 
Parquet record readers.
+   * Responsible for instantiating [[RecordMaterializer]], which is used for 
converting Parquet
+   * records to Catalyst [[InternalRow]]s.
+   */
   override def prepareForRead(
   conf: Configuration,
   keyValueMetaData: JMap[String, String],
   fileSchema: MessageType,
   readContext: ReadContext): Recor

spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes

2015-09-28 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e0c3212a9 -> de259316b


[SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate 
the codes

This bug is introduced in 
[SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), 
`targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not 
set. Using 0 instead will meet the problem as mentioned in 
[SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).

Also consolidate and simplify some similar code snippets to keep the consistent 
semantics.

Author: jerryshao 

Closes #8910 from jerryshao/SPARK-10790.

(cherry picked from commit 353c30bd7dfbd3b76fc8bc9a6dfab9321439a34b)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de259316
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de259316
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de259316

Branch: refs/heads/branch-1.5
Commit: de259316b491762dbcffd1667b669f909125dd13
Parents: e0c3212
Author: jerryshao 
Authored: Mon Sep 28 06:38:54 2015 -0700
Committer: Marcelo Vanzin 
Committed: Mon Sep 28 06:39:13 2015 -0700

--
 .../spark/deploy/yarn/ClientArguments.scala | 20 +
 .../spark/deploy/yarn/YarnAllocator.scala   |  6 +
 .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 23 
 .../cluster/YarnClusterSchedulerBackend.scala   | 18 ++-
 4 files changed, 27 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
index 54f62e6..1165061 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
@@ -81,25 +81,7 @@ private[spark] class ClientArguments(args: Array[String], 
sparkConf: SparkConf)
   .orNull
 // If dynamic allocation is enabled, start at the configured initial 
number of executors.
 // Default to minExecutors if no initialExecutors is set.
-if (isDynamicAllocationEnabled) {
-  val minExecutorsConf = "spark.dynamicAllocation.minExecutors"
-  val initialExecutorsConf = "spark.dynamicAllocation.initialExecutors"
-  val maxExecutorsConf = "spark.dynamicAllocation.maxExecutors"
-  val minNumExecutors = sparkConf.getInt(minExecutorsConf, 0)
-  val initialNumExecutors = sparkConf.getInt(initialExecutorsConf, 
minNumExecutors)
-  val maxNumExecutors = sparkConf.getInt(maxExecutorsConf, 
Integer.MAX_VALUE)
-
-  // If defined, initial executors must be between min and max
-  if (initialNumExecutors < minNumExecutors || initialNumExecutors > 
maxNumExecutors) {
-throw new IllegalArgumentException(
-  s"$initialExecutorsConf must be between $minExecutorsConf and 
$maxNumExecutors!")
-  }
-
-  numExecutors = initialNumExecutors
-} else {
-  val numExecutorsConf = "spark.executor.instances"
-  numExecutors = sparkConf.getInt(numExecutorsConf, numExecutors)
-}
+numExecutors = 
YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf)
 principal = Option(principal)
   .orElse(sparkConf.getOption("spark.yarn.principal"))
   .orNull

http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index ccf753e..6a02848 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -89,11 +89,7 @@ private[yarn] class YarnAllocator(
   @volatile private var numExecutorsFailed = 0
 
   @volatile private var targetNumExecutors =
-if (Utils.isDynamicAllocationEnabled(sparkConf)) {
-  sparkConf.getInt("spark.dynamicAllocation.initialExecutors", 0)
-} else {
-  sparkConf.getInt("spark.executor.instances", 
YarnSparkHadoopUtil.DEFAULT_NUMBER_EXECUTORS)
-}
+YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf)
 
   // Keep track of which container is running which executor to remove the 
executors later
   // Visible for testing.

http://git-wip-us.apache.org/repos/asf/spark/blob/de259316/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
---

spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes

2015-09-28 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master d8d50ed38 -> 353c30bd7


[SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate 
the codes

This bug is introduced in 
[SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), 
`targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not 
set. Using 0 instead will meet the problem as mentioned in 
[SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).

Also consolidate and simplify some similar code snippets to keep the consistent 
semantics.

Author: jerryshao 

Closes #8910 from jerryshao/SPARK-10790.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/353c30bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/353c30bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/353c30bd

Branch: refs/heads/master
Commit: 353c30bd7dfbd3b76fc8bc9a6dfab9321439a34b
Parents: d8d50ed
Author: jerryshao 
Authored: Mon Sep 28 06:38:54 2015 -0700
Committer: Marcelo Vanzin 
Committed: Mon Sep 28 06:38:54 2015 -0700

--
 .../spark/deploy/yarn/ClientArguments.scala | 20 +
 .../spark/deploy/yarn/YarnAllocator.scala   |  6 +
 .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 23 
 .../cluster/YarnClusterSchedulerBackend.scala   | 18 ++-
 4 files changed, 27 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
index 54f62e6..1165061 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
@@ -81,25 +81,7 @@ private[spark] class ClientArguments(args: Array[String], 
sparkConf: SparkConf)
   .orNull
 // If dynamic allocation is enabled, start at the configured initial 
number of executors.
 // Default to minExecutors if no initialExecutors is set.
-if (isDynamicAllocationEnabled) {
-  val minExecutorsConf = "spark.dynamicAllocation.minExecutors"
-  val initialExecutorsConf = "spark.dynamicAllocation.initialExecutors"
-  val maxExecutorsConf = "spark.dynamicAllocation.maxExecutors"
-  val minNumExecutors = sparkConf.getInt(minExecutorsConf, 0)
-  val initialNumExecutors = sparkConf.getInt(initialExecutorsConf, 
minNumExecutors)
-  val maxNumExecutors = sparkConf.getInt(maxExecutorsConf, 
Integer.MAX_VALUE)
-
-  // If defined, initial executors must be between min and max
-  if (initialNumExecutors < minNumExecutors || initialNumExecutors > 
maxNumExecutors) {
-throw new IllegalArgumentException(
-  s"$initialExecutorsConf must be between $minExecutorsConf and 
$maxNumExecutors!")
-  }
-
-  numExecutors = initialNumExecutors
-} else {
-  val numExecutorsConf = "spark.executor.instances"
-  numExecutors = sparkConf.getInt(numExecutorsConf, numExecutors)
-}
+numExecutors = 
YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf)
 principal = Option(principal)
   .orElse(sparkConf.getOption("spark.yarn.principal"))
   .orNull

http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index fd88b8b..9e1ef1b 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -89,11 +89,7 @@ private[yarn] class YarnAllocator(
   @volatile private var numExecutorsFailed = 0
 
   @volatile private var targetNumExecutors =
-if (Utils.isDynamicAllocationEnabled(sparkConf)) {
-  sparkConf.getInt("spark.dynamicAllocation.initialExecutors", 0)
-} else {
-  sparkConf.getInt("spark.executor.instances", 
YarnSparkHadoopUtil.DEFAULT_NUMBER_EXECUTORS)
-}
+YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sparkConf)
 
   // Executor loss reason requests that are pending - maps from executor ID 
for inquiry to a
   // list of requesters that should be responded to once we find out why the 
given executor

http://git-wip-us.apache.org/repos/asf/spark/blob/353c30bd/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
--
diff --git

spark git commit: [SPARK-10812] [YARN] Spark hadoop util support switching to yarn

2015-09-28 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master b58249930 -> d8d50ed38


[SPARK-10812] [YARN] Spark hadoop util support switching to yarn

While this is likely not a huge issue for real production systems, for test 
systems which may setup a Spark Context and tear it down and stand up a Spark 
Context with a different master (e.g. some local mode & some yarn mode) tests 
this cane be an issue. Discovered during work on spark-testing-base on Spark 
1.4.1, but seems like the logic that triggers it is present in master (see 
SparkHadoopUtil object). A valid work around for users encountering this issue 
is to fork a different JVM, however this can be heavy weight.

```
[info] SampleMiniClusterTest:
[info] Exception encountered when attempting to run a suite with class name: 
com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
[info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil 
cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
[info] at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
[info] at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
[info] at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
[info] at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
[info] at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
[info] at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
[info] at org.apache.spark.SparkContext.(SparkContext.scala:497)
[info] at 
com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
[info] at 
com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
[info] at 
com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
```

Author: Holden Karau 

Closes #8911 from 
holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8d50ed3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8d50ed3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8d50ed3

Branch: refs/heads/master
Commit: d8d50ed388d2e695b69d2b93a620045ef2f0bc18
Parents: b582499
Author: Holden Karau 
Authored: Mon Sep 28 06:33:45 2015 -0700
Committer: Marcelo Vanzin 
Committed: Mon Sep 28 06:33:45 2015 -0700

--
 .../scala/org/apache/spark/SparkContext.scala   |  2 ++
 .../apache/spark/deploy/SparkHadoopUtil.scala   | 30 ++--
 .../org/apache/spark/deploy/yarn/Client.scala   |  6 +++-
 .../deploy/yarn/YarnSparkHadoopUtilSuite.scala  | 12 
 4 files changed, 34 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d8d50ed3/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index bf3aeb4..0c72adf 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -1756,6 +1756,8 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   }
   SparkEnv.set(null)
 }
+// Unset YARN mode system env variable, to allow switching between cluster 
types.
+System.clearProperty("SPARK_YARN_MODE")
 SparkContext.clearActiveContext()
 logInfo("Successfully stopped SparkContext")
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/d8d50ed3/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
index a0b7365..d606b80 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -385,20 +385,13 @@ class SparkHadoopUtil extends Logging {
 
 object SparkHadoopUtil {
 
-  private val hadoop = {
-val yarnMode = java.lang.Boolean.valueOf(
-System.getProperty("SPARK_YARN_MODE", 
System.getenv("SPARK_YARN_MODE")))
-if (yarnMode) {
-  try {
-Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
-  .newInstance()
-  .asInstanceOf[SparkHadoopUtil]
-  } catch {
-   case e: Exception => throw new SparkException("Unable to load YARN 
support", e)
-  }
-} else {
-  new SparkHadoopUtil
-}
+  private lazy val hadoop = new SparkHadoopUtil
+  priva

spark git commit: Fix two mistakes in programming-guide page

2015-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master fb4c7be74 -> b58249930


Fix two mistakes in programming-guide page

seperate -> separate
sees -> see

Author: David Martin 

Closes #8928 from dmartinpro/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5824993
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5824993
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5824993

Branch: refs/heads/master
Commit: b58249930d58e2de238c05aaf5fa9315b4c3cbab
Parents: fb4c7be
Author: David Martin 
Authored: Mon Sep 28 10:41:39 2015 +0100
Committer: Sean Owen 
Committed: Mon Sep 28 10:41:39 2015 +0100

--
 docs/programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b5824993/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 8ad2383..22656fd 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -805,9 +805,9 @@ print("Counter value: " + counter)
 
 The primary challenge is that the behavior of the above code is undefined. In 
local mode with a single JVM, the above code will sum the values within the RDD 
and store it in **counter**. This is because both the RDD and the variable 
**counter** are in the same memory space on the driver node.
 
-However, in `cluster` mode, what happens is more complicated, and the above 
may not work as intended. To execute jobs, Spark breaks up the processing of 
RDD operations into tasks - each of which is operated on by an executor. Prior 
to execution, Spark computes the **closure**. The closure is those variables 
and methods which must be visible for the executor to perform its computations 
on the RDD (in this case `foreach()`). This closure is serialized and sent to 
each executor. In `local` mode, there is only the one executors so everything 
shares the same closure. In other modes however, this is not the case and the 
executors running on seperate worker nodes each have their own copy of the 
closure.
+However, in `cluster` mode, what happens is more complicated, and the above 
may not work as intended. To execute jobs, Spark breaks up the processing of 
RDD operations into tasks - each of which is operated on by an executor. Prior 
to execution, Spark computes the **closure**. The closure is those variables 
and methods which must be visible for the executor to perform its computations 
on the RDD (in this case `foreach()`). This closure is serialized and sent to 
each executor. In `local` mode, there is only the one executors so everything 
shares the same closure. In other modes however, this is not the case and the 
executors running on separate worker nodes each have their own copy of the 
closure.
 
-What is happening here is that the variables within the closure sent to each 
executor are now copies and thus, when **counter** is referenced within the 
`foreach` function, it's no longer the **counter** on the driver node. There is 
still a **counter** in the memory of the driver node but this is no longer 
visible to the executors! The executors only sees the copy from the serialized 
closure. Thus, the final value of **counter** will still be zero since all 
operations on **counter** were referencing the value within the serialized 
closure.  
+What is happening here is that the variables within the closure sent to each 
executor are now copies and thus, when **counter** is referenced within the 
`foreach` function, it's no longer the **counter** on the driver node. There is 
still a **counter** in the memory of the driver node but this is no longer 
visible to the executors! The executors only see the copy from the serialized 
closure. Thus, the final value of **counter** will still be zero since all 
operations on **counter** were referencing the value within the serialized 
closure.  
 
 To ensure well-defined behavior in these sorts of scenarios one should use an 
[`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to 
provide a mechanism for safely updating a variable when execution is split up 
across worker nodes in a cluster. The Accumulators section of this guide 
discusses these in more detail.  
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10670] [ML] [Doc] add api reference for ml doc

[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

[1/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

[2/2] spark git commit: [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE

spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache

spark git commit: [SPARK-10859] [SQL] fix stats of StringType in columnar cache

spark git commit: [SPARK-10395] [SQL] Simplifies CatalystReadSupport

spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes

spark git commit: [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes

spark git commit: [SPARK-10812] [YARN] Spark hadoop util support switching to yarn

spark git commit: Fix two mistakes in programming-guide page

12 matches

Site Navigation

Mail list logo

Footer information