from:"srowen"

spark git commit: [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as it was removed from the Scala API prior to Spark 2.0.0

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master def7c265f -> b5bfcddbf


[SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as 
it was removed from the Scala API prior to Spark 2.0.0

## What changes were proposed in this pull request?

This pull request removes the SparkContext.clearFiles() method from the PySpark 
API as the method was removed from the Scala API in 
8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to 
an exception as PySpark tries to call the non-existent method on the JVM side.

## How was this patch tested?

Existing tests (though none of them tested this particular method).

Author: Sami Jaktholm 

Closes #15081 from sjakthol/pyspark-sc-clearfiles.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5bfcddb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5bfcddb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5bfcddb

Branch: refs/heads/master
Commit: b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3
Parents: def7c26
Author: Sami Jaktholm 
Authored: Wed Sep 14 09:38:30 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 09:38:30 2016 +0100

--
 python/pyspark/context.py | 8 
 1 file changed, 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b5bfcddb/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 6e9f24e..2744bb9 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -787,14 +787,6 @@ class SparkContext(object):
 """
 self._jsc.sc().addFile(path)
 
-def clearFiles(self):
-"""
-Clear the job's list of files added by L{addFile} or L{addPyFile} so
-that they do not get downloaded to any new nodes.
-"""
-# TODO: remove added .py or .zip files from the PYTHONPATH?
-self._jsc.sc().clearFiles()
-
 def addPyFile(self, path):
 """
 Add a .py or .zip dependency for all tasks to be executed on this


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as it was removed from the Scala API prior to Spark 2.0.0

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 c1426452b -> 12ebfbedd


[SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as 
it was removed from the Scala API prior to Spark 2.0.0

## What changes were proposed in this pull request?

This pull request removes the SparkContext.clearFiles() method from the PySpark 
API as the method was removed from the Scala API in 
8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to 
an exception as PySpark tries to call the non-existent method on the JVM side.

## How was this patch tested?

Existing tests (though none of them tested this particular method).

Author: Sami Jaktholm 

Closes #15081 from sjakthol/pyspark-sc-clearfiles.

(cherry picked from commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12ebfbed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12ebfbed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12ebfbed

Branch: refs/heads/branch-2.0
Commit: 12ebfbeddf057efb666a7b6365c948c3fe479f2c
Parents: c142645
Author: Sami Jaktholm 
Authored: Wed Sep 14 09:38:30 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 09:38:39 2016 +0100

--
 python/pyspark/context.py | 8 
 1 file changed, 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12ebfbed/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 6e9f24e..2744bb9 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -787,14 +787,6 @@ class SparkContext(object):
 """
 self._jsc.sc().addFile(path)
 
-def clearFiles(self):
-"""
-Clear the job's list of files added by L{addFile} or L{addPyFile} so
-that they do not get downloaded to any new nodes.
-"""
-# TODO: remove added .py or .zip files from the PYTHONPATH?
-self._jsc.sc().clearFiles()
-
 def addPyFile(self, path):
 """
 Add a .py or .zip dependency for all tasks to be executed on this


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [CORE][DOC] remove redundant comment

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b5bfcddbf -> 18b4f035f


[CORE][DOC] remove redundant comment

## What changes were proposed in this pull request?
In the comment, there is redundant `the estimated`.

This PR simply remove the redundant comment and adjusts format.

Author: wm...@hotmail.com 

Closes #15091 from wangmiao1981/comment.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18b4f035
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18b4f035
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18b4f035

Branch: refs/heads/master
Commit: 18b4f035f40359b3164456d0dab52dbc762ea3b4
Parents: b5bfcdd
Author: wm...@hotmail.com 
Authored: Wed Sep 14 09:49:15 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 09:49:15 2016 +0100

--
 .../apache/spark/storage/memory/MemoryStore.scala | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/18b4f035/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala 
b/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala
index 1a3bf2b..baa3fde 100644
--- a/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala
+++ b/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala
@@ -169,12 +169,12 @@ private[spark] class MemoryStore(
* temporary unroll memory used during the materialization is "transferred" 
to storage memory,
* so we won't acquire more memory than is actually needed to store the 
block.
*
-   * @return in case of success, the estimated the estimated size of the 
stored data. In case of
-   * failure, return an iterator containing the values of the block. 
The returned iterator
-   * will be backed by the combination of the partially-unrolled block 
and the remaining
-   * elements of the original input iterator. The caller must either 
fully consume this
-   * iterator or call `close()` on it in order to free the storage 
memory consumed by the
-   * partially-unrolled block.
+   * @return in case of success, the estimated size of the stored data. In 
case of failure, return
+   * an iterator containing the values of the block. The returned 
iterator will be backed
+   * by the combination of the partially-unrolled block and the 
remaining elements of the
+   * original input iterator. The caller must either fully consume 
this iterator or call
+   * `close()` on it in order to free the storage memory consumed by 
the partially-unrolled
+   * block.
*/
   private[storage] def putIteratorAsValues[T](
   blockId: BlockId,
@@ -298,9 +298,9 @@ private[spark] class MemoryStore(
* temporary unroll memory used during the materialization is "transferred" 
to storage memory,
* so we won't acquire more memory than is actually needed to store the 
block.
*
-   * @return in case of success, the estimated the estimated size of the 
stored data. In case of
-   * failure, return a handle which allows the caller to either finish 
the serialization
-   * by spilling to disk or to deserialize the partially-serialized 
block and reconstruct
+   * @return in case of success, the estimated size of the stored data. In 
case of failure,
+   * return a handle which allows the caller to either finish the 
serialization by
+   * spilling to disk or to deserialize the partially-serialized block 
and reconstruct
* the original input iterator. The caller must either fully consume 
this result
* iterator or call `discard()` on it in order to free the storage 
memory consumed by the
* partially-unrolled block.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n)

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 18b4f035f -> 4cea9da2a


[SPARK-17480][SQL] Improve performance by removing or caching List.length which 
is O(n)

## What changes were proposed in this pull request?
Scala's List.length method is O(N) and it makes the gatherCompressibilityStats 
function O(N^2). Eliminate the List.length calls by writing it in Scala way.

https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36

As suggested. Extended the fix to HiveInspectors and AggregationIterator 
classes as well.

## How was this patch tested?
Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of 
the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats 
is using 23% of it. 6.24% of the CPU is spend on List.length which is called 
inside gatherCompressibilityStats.

After this change we started to save 6.24% of the CPU.

Author: Ergin Seyfe 

Closes #15032 from seyfe/gatherCompressibilityStats.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4cea9da2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4cea9da2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4cea9da2

Branch: refs/heads/master
Commit: 4cea9da2ae88b40a5503111f8f37051e2372163e
Parents: 18b4f03
Author: Ergin Seyfe 
Authored: Wed Sep 14 09:51:14 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 09:51:14 2016 +0100

--
 .../spark/sql/execution/aggregate/AggregationIterator.scala   | 7 ---
 .../columnar/compression/CompressibleColumnBuilder.scala  | 6 +-
 .../main/scala/org/apache/spark/sql/hive/HiveInspectors.scala | 6 --
 3 files changed, 9 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4cea9da2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
index dfed084..f335912 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
@@ -73,9 +73,10 @@ abstract class AggregationIterator(
   startingInputBufferOffset: Int): Array[AggregateFunction] = {
 var mutableBufferOffset = 0
 var inputBufferOffset: Int = startingInputBufferOffset
-val functions = new Array[AggregateFunction](expressions.length)
+val expressionsLength = expressions.length
+val functions = new Array[AggregateFunction](expressionsLength)
 var i = 0
-while (i < expressions.length) {
+while (i < expressionsLength) {
   val func = expressions(i).aggregateFunction
   val funcWithBoundReferences: AggregateFunction = expressions(i).mode 
match {
 case Partial | Complete if func.isInstanceOf[ImperativeAggregate] =>
@@ -171,7 +172,7 @@ abstract class AggregationIterator(
 case PartialMerge | Final =>
   (buffer: MutableRow, row: InternalRow) => ae.merge(buffer, row)
   }
-  }
+  }.toArray
   // This projection is used to merge buffer values for all 
expression-based aggregates.
   val aggregationBufferSchema = functions.flatMap(_.aggBufferAttributes)
   val updateProjection =

http://git-wip-us.apache.org/repos/asf/spark/blob/4cea9da2/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
index 63eae1b..0f4680e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
@@ -66,11 +66,7 @@ private[columnar] trait CompressibleColumnBuilder[T <: 
AtomicType]
   }
 
   private def gatherCompressibilityStats(row: InternalRow, ordinal: Int): Unit 
= {
-var i = 0
-while (i < compressionEncoders.length) {
-  compressionEncoders(i).gatherCompressibilityStats(row, ordinal)
-  i += 1
-}
+compressionEncoders.foreach(_.gatherCompressibilityStats(row, ordinal))
   }
 
   abstract override def appendFrom(row: InternalRow, ordinal: Int): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/4cea9da2/sql/hive/src/main/scala/org/apa

spark git commit: [SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n)

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 12ebfbedd -> c6ea748a7


[SPARK-17480][SQL] Improve performance by removing or caching List.length which 
is O(n)

## What changes were proposed in this pull request?
Scala's List.length method is O(N) and it makes the gatherCompressibilityStats 
function O(N^2). Eliminate the List.length calls by writing it in Scala way.

https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36

As suggested. Extended the fix to HiveInspectors and AggregationIterator 
classes as well.

## How was this patch tested?
Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of 
the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats 
is using 23% of it. 6.24% of the CPU is spend on List.length which is called 
inside gatherCompressibilityStats.

After this change we started to save 6.24% of the CPU.

Author: Ergin Seyfe 

Closes #15032 from seyfe/gatherCompressibilityStats.

(cherry picked from commit 4cea9da2ae88b40a5503111f8f37051e2372163e)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c6ea748a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c6ea748a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c6ea748a

Branch: refs/heads/branch-2.0
Commit: c6ea748a7e0baec222cbb4bd130673233adc5e0c
Parents: 12ebfbe
Author: Ergin Seyfe 
Authored: Wed Sep 14 09:51:14 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 09:51:22 2016 +0100

--
 .../spark/sql/execution/aggregate/AggregationIterator.scala   | 7 ---
 .../columnar/compression/CompressibleColumnBuilder.scala  | 6 +-
 .../main/scala/org/apache/spark/sql/hive/HiveInspectors.scala | 6 --
 3 files changed, 9 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c6ea748a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
index 34de76d..6ca36e4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala
@@ -73,9 +73,10 @@ abstract class AggregationIterator(
   startingInputBufferOffset: Int): Array[AggregateFunction] = {
 var mutableBufferOffset = 0
 var inputBufferOffset: Int = startingInputBufferOffset
-val functions = new Array[AggregateFunction](expressions.length)
+val expressionsLength = expressions.length
+val functions = new Array[AggregateFunction](expressionsLength)
 var i = 0
-while (i < expressions.length) {
+while (i < expressionsLength) {
   val func = expressions(i).aggregateFunction
   val funcWithBoundReferences: AggregateFunction = expressions(i).mode 
match {
 case Partial | Complete if func.isInstanceOf[ImperativeAggregate] =>
@@ -171,7 +172,7 @@ abstract class AggregationIterator(
 case PartialMerge | Final =>
   (buffer: MutableRow, row: InternalRow) => ae.merge(buffer, row)
   }
-  }
+  }.toArray
   // This projection is used to merge buffer values for all 
expression-based aggregates.
   val aggregationBufferSchema = functions.flatMap(_.aggBufferAttributes)
   val updateProjection =

http://git-wip-us.apache.org/repos/asf/spark/blob/c6ea748a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
index 63eae1b..0f4680e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressibleColumnBuilder.scala
@@ -66,11 +66,7 @@ private[columnar] trait CompressibleColumnBuilder[T <: 
AtomicType]
   }
 
   private def gatherCompressibilityStats(row: InternalRow, ordinal: Int): Unit 
= {
-var i = 0
-while (i < compressionEncoders.length) {
-  compressionEncoders(i).gatherCompressibilityStats(row, ordinal)
-  i += 1
-}
+compressionEncoders.foreach(_.gatherCompressibilityStats(row, ordinal))
   }
 
   abstract override def appendFrom(row: InternalRow, ordinal: In

[2/2] spark-website git commit: Redirect third party packages link to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

2016-09-14 Thread srowen

Redirect third party packages link to 
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/a78faf58
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/a78faf58
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/a78faf58

Branch: refs/heads/asf-site
Commit: a78faf5822bca343694776ea3ec8457fa780f09f
Parents: 0845f49
Author: Sean Owen 
Authored: Tue Sep 13 10:11:27 2016 +0100
Committer: Sean Owen 
Committed: Tue Sep 13 10:11:27 2016 +0100

--
 _layouts/global.html| 4 ++--
 site/community.html | 4 ++--
 site/documentation.html | 4 ++--
 site/downloads.html | 4 ++--
 site/examples.html  | 4 ++--
 site/faq.html   | 4 ++--
 site/graphx/index.html  | 4 ++--
 site/index.html | 4 ++--
 site/mailing-lists.html | 4 ++--
 site/mllib/index.html   | 4 ++--
 site/news/amp-camp-2013-registration-ope.html   | 4 ++--
 site/news/announcing-the-first-spark-summit.html| 4 ++--
 site/news/fourth-spark-screencast-published.html| 4 ++--
 site/news/index.html| 4 ++--
 site/news/nsdi-paper.html   | 4 ++--
 site/news/one-month-to-spark-summit-2015.html   | 4 ++--
 site/news/proposals-open-for-spark-summit-east.html | 4 ++--
 site/news/registration-open-for-spark-summit-east.html  | 4 ++--
 site/news/run-spark-and-shark-on-amazon-emr.html| 4 ++--
 site/news/spark-0-6-1-and-0-5-2-released.html   | 4 ++--
 site/news/spark-0-6-2-released.html | 4 ++--
 site/news/spark-0-7-0-released.html | 4 ++--
 site/news/spark-0-7-2-released.html | 4 ++--
 site/news/spark-0-7-3-released.html | 4 ++--
 site/news/spark-0-8-0-released.html | 4 ++--
 site/news/spark-0-8-1-released.html | 4 ++--
 site/news/spark-0-9-0-released.html | 4 ++--
 site/news/spark-0-9-1-released.html | 4 ++--
 site/news/spark-0-9-2-released.html | 4 ++--
 site/news/spark-1-0-0-released.html | 4 ++--
 site/news/spark-1-0-1-released.html | 4 ++--
 site/news/spark-1-0-2-released.html | 4 ++--
 site/news/spark-1-1-0-released.html | 4 ++--
 site/news/spark-1-1-1-released.html | 4 ++--
 site/news/spark-1-2-0-released.html | 4 ++--
 site/news/spark-1-2-1-released.html | 4 ++--
 site/news/spark-1-2-2-released.html | 4 ++--
 site/news/spark-1-3-0-released.html | 4 ++--
 site/news/spark-1-4-0-released.html | 4 ++--
 site/news/spark-1-4-1-released.html | 4 ++--
 site/news/spark-1-5-0-released.html | 4 ++--
 site/news/spark-1-5-1-released.html | 4 ++--
 site/news/spark-1-5-2-released.html | 4 ++--
 site/news/spark-1-6-0-released.html | 4 ++--
 site/news/spark-1-6-1-released.html | 4 ++--
 site/news/spark-1-6-2-released.html | 4 ++--
 site/news/spark-2-0-0-released.html | 4 ++--
 site/news/spark-2.0.0-preview.html  | 4 ++--
 site/news/spark-accepted-into-apache-incubator.html | 4 ++--
 site/news/spark-and-shark-in-the-news.html  | 4 ++--
 site/news/spark-becomes-tlp.html| 4 ++--
 site/news/spark-featured-in-wired.html  | 4 ++--
 site/news/spark-mailing-lists-moving-to-apache.html | 4 ++--
 site/news/spark-meetups.html| 4 ++--
 site/news/spark-screencasts-published.html  | 4 ++--
 site/news/spark-summit-2013-is-a-wrap.html  | 4 ++--
 site/news/spark-summit-2014-videos-posted.html  | 4 ++--
 site/news/spark-summit-2015-videos-posted.html  | 4 ++--
 site/news/spark-summit-agenda-posted.html   | 4 ++--
 site/news/spark-summit-east-2015-videos-posted.html | 4 ++--
 site/news/spark-summit-east-2016-cfp-closing.html   | 4 ++--
 site/news/spark-summit-east-agenda-

[1/2] spark-website git commit: Redirect third party packages link to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

2016-09-14 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 0845f49de -> a78faf582


http://git-wip-us.apache.org/repos/asf/spark-website/blob/a78faf58/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
--
diff --git a/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html 
b/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
index 0b83cab..b168c6c 100644
--- a/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
+++ b/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
@@ -98,7 +98,7 @@
   MLlib (machine learning)
   GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
   
@@ -178,7 +178,7 @@
 MLlib (machine learning)
 GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a78faf58/site/news/strata-exercises-now-available-online.html
--
diff --git a/site/news/strata-exercises-now-available-online.html 
b/site/news/strata-exercises-now-available-online.html
index 1b7fd25..fec18a0 100644
--- a/site/news/strata-exercises-now-available-online.html
+++ b/site/news/strata-exercises-now-available-online.html
@@ -98,7 +98,7 @@
   MLlib (machine learning)
   GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
   
@@ -178,7 +178,7 @@
 MLlib (machine learning)
 GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a78faf58/site/news/submit-talks-to-spark-summit-2014.html
--
diff --git a/site/news/submit-talks-to-spark-summit-2014.html 
b/site/news/submit-talks-to-spark-summit-2014.html
index bc9b2e7..ffe0dc2 100644
--- a/site/news/submit-talks-to-spark-summit-2014.html
+++ b/site/news/submit-talks-to-spark-summit-2014.html
@@ -98,7 +98,7 @@
   MLlib (machine learning)
   GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
   
@@ -178,7 +178,7 @@
 MLlib (machine learning)
 GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a78faf58/site/news/submit-talks-to-spark-summit-2016.html
--
diff --git a/site/news/submit-talks-to-spark-summit-2016.html 
b/site/news/submit-talks-to-spark-summit-2016.html
index fcee041..f4c1cb6 100644
--- a/site/news/submit-talks-to-spark-summit-2016.html
+++ b/site/news/submit-talks-to-spark-summit-2016.html
@@ -98,7 +98,7 @@
   MLlib (machine learning)
   GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
   
@@ -178,7 +178,7 @@
 MLlib (machine learning)
 GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a78faf58/site/news/submit-talks-to-spark-summit-east-2016.html
--
diff --git a/site/news/submit-talks-to-spark-summit-east-2016.html 
b/site/news/submit-talks-to-spark-summit-east-2016.html
index a5ad0d8..4858b9d 100644
--- a/site/news/submit-talks-to-spark-summit-east-2016.html
+++ b/site/news/submit-talks-to-spark-summit-east-2016.html
@@ -98,7 +98,7 @@
   MLlib (machine learning)
   GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects";>Third-Party
 Packages
 
   
   
@@ -178,7 +178,7 @@
 MLlib (machine learning)
 GraphX (graph)
   
-  http://spark-packages.org";>Third-Party Packages
+  https:

spark git commit: [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 4cea9da2a -> dc0a4c916


[SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party 
packages

## What changes were proposed in this pull request?

Point references to spark-packages.org to 
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

This will be accompanied by a parallel change to the spark-website repo, and 
additional changes to this wiki.

## How was this patch tested?

Jenkins tests.

Author: Sean Owen 

Closes #15075 from srowen/SPARK-17445.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc0a4c91
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc0a4c91
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc0a4c91

Branch: refs/heads/master
Commit: dc0a4c916151c795dc41b5714e9d23b4937f4636
Parents: 4cea9da
Author: Sean Owen 
Authored: Wed Sep 14 10:10:16 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 10:10:16 2016 +0100

--
 CONTRIBUTING.md | 2 +-
 R/pkg/R/sparkR.R| 4 ++--
 docs/_layouts/global.html   | 2 +-
 docs/index.md   | 2 +-
 docs/sparkr.md  | 3 ++-
 docs/streaming-programming-guide.md | 2 +-
 .../apache/spark/sql/execution/datasources/DataSource.scala | 7 ---
 .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 9 +++--
 .../apache/spark/sql/sources/ResolvedDataSourceSuite.scala  | 6 +++---
 9 files changed, 18 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc0a4c91/CONTRIBUTING.md
--
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index f10d7e2..1a8206a 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ It lists steps that are required before creating a PR. In 
particular, consider:
 
 - Is the change important and ready enough to ask the community to spend time 
reviewing?
 - Have you searched for existing, related JIRAs and pull requests?
-- Is this a new feature that can stand alone as a package on 
http://spark-packages.org ?
+- Is this a new feature that can stand alone as a [third party 
project](https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)
 ?
 - Is the change being proposed clearly explained and motivated?
 
 When you contribute code, you affirm that the contribution is your original 
work and that you 

http://git-wip-us.apache.org/repos/asf/spark/blob/dc0a4c91/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 15afe01..0601536 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -100,7 +100,7 @@ sparkR.stop <- function() {
 #' @param sparkEnvir Named list of environment variables to set on worker nodes
 #' @param sparkExecutorEnv Named list of environment variables to be used when 
launching executors
 #' @param sparkJars Character vector of jar files to pass to the worker nodes
-#' @param sparkPackages Character vector of packages from spark-packages.org
+#' @param sparkPackages Character vector of package coordinates
 #' @seealso \link{sparkR.session}
 #' @rdname sparkR.init-deprecated
 #' @export
@@ -327,7 +327,7 @@ sparkRHive.init <- function(jsc = NULL) {
 #' @param sparkHome Spark Home directory.
 #' @param sparkConfig named list of Spark configuration to set on worker nodes.
 #' @param sparkJars character vector of jar files to pass to the worker nodes.
-#' @param sparkPackages character vector of packages from spark-packages.org
+#' @param sparkPackages character vector of package coordinates
 #' @param enableHiveSupport enable support for Hive, fallback if not built 
with Hive support; once
 #'set, this cannot be turned off on an existing session
 #' @param ... named Spark properties passed to the method.

http://git-wip-us.apache.org/repos/asf/spark/blob/dc0a4c91/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index d3bf082..ad5b5c9 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -114,7 +114,7 @@
 
 Building 
Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark";>Contributing
 to Spark
-https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects"

spark git commit: [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages

2016-09-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 c6ea748a7 -> 5493107d9


[SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party 
packages

## What changes were proposed in this pull request?

Point references to spark-packages.org to 
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

This will be accompanied by a parallel change to the spark-website repo, and 
additional changes to this wiki.

## How was this patch tested?

Jenkins tests.

Author: Sean Owen 

Closes #15075 from srowen/SPARK-17445.

(cherry picked from commit dc0a4c916151c795dc41b5714e9d23b4937f4636)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5493107d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5493107d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5493107d

Branch: refs/heads/branch-2.0
Commit: 5493107d99977964cca1c15a2b0e084899e96dac
Parents: c6ea748
Author: Sean Owen 
Authored: Wed Sep 14 10:10:16 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 14 10:14:57 2016 +0100

--
 CONTRIBUTING.md | 2 +-
 R/pkg/R/sparkR.R| 4 ++--
 docs/_layouts/global.html   | 2 +-
 docs/index.md   | 2 +-
 docs/sparkr.md  | 3 ++-
 docs/streaming-programming-guide.md | 2 +-
 .../apache/spark/sql/execution/datasources/DataSource.scala | 7 ---
 .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 9 +++--
 .../apache/spark/sql/sources/ResolvedDataSourceSuite.scala  | 6 +++---
 9 files changed, 18 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5493107d/CONTRIBUTING.md
--
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index f10d7e2..1a8206a 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ It lists steps that are required before creating a PR. In 
particular, consider:
 
 - Is the change important and ready enough to ask the community to spend time 
reviewing?
 - Have you searched for existing, related JIRAs and pull requests?
-- Is this a new feature that can stand alone as a package on 
http://spark-packages.org ?
+- Is this a new feature that can stand alone as a [third party 
project](https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)
 ?
 - Is the change being proposed clearly explained and motivated?
 
 When you contribute code, you affirm that the contribution is your original 
work and that you 

http://git-wip-us.apache.org/repos/asf/spark/blob/5493107d/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 15afe01..0601536 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -100,7 +100,7 @@ sparkR.stop <- function() {
 #' @param sparkEnvir Named list of environment variables to set on worker nodes
 #' @param sparkExecutorEnv Named list of environment variables to be used when 
launching executors
 #' @param sparkJars Character vector of jar files to pass to the worker nodes
-#' @param sparkPackages Character vector of packages from spark-packages.org
+#' @param sparkPackages Character vector of package coordinates
 #' @seealso \link{sparkR.session}
 #' @rdname sparkR.init-deprecated
 #' @export
@@ -327,7 +327,7 @@ sparkRHive.init <- function(jsc = NULL) {
 #' @param sparkHome Spark Home directory.
 #' @param sparkConfig named list of Spark configuration to set on worker nodes.
 #' @param sparkJars character vector of jar files to pass to the worker nodes.
-#' @param sparkPackages character vector of packages from spark-packages.org
+#' @param sparkPackages character vector of package coordinates
 #' @param enableHiveSupport enable support for Hive, fallback if not built 
with Hive support; once
 #'set, this cannot be turned off on an existing session
 #' @param ... named Spark properties passed to the method.

http://git-wip-us.apache.org/repos/asf/spark/blob/5493107d/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index d3bf082..ad5b5c9 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -114,7 +114,7 @@
 
 Building 
Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark";>Contributing
 to Spark
-

spark git commit: [SPARK-17507][ML][MLLIB] check weight vector size in ANN

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 6a6adb167 -> d15b4f90e


[SPARK-17507][ML][MLLIB] check weight vector size in ANN

## What changes were proposed in this pull request?

as the TODO described,
check weight vector size and if wrong throw exception.

## How was this patch tested?

existing tests.

Author: WeichenXu 

Closes #15060 from WeichenXu123/check_input_weight_size_of_ann.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d15b4f90
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d15b4f90
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d15b4f90

Branch: refs/heads/master
Commit: d15b4f90e64f7ec5cf14c7c57d2cb4234c3ce677
Parents: 6a6adb1
Author: WeichenXu 
Authored: Thu Sep 15 09:30:15 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 09:30:15 2016 +0100

--
 mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d15b4f90/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
index 88909a9..e7e0dae 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
@@ -545,7 +545,9 @@ private[ann] object FeedForwardModel {
* @return model
*/
   def apply(topology: FeedForwardTopology, weights: Vector): FeedForwardModel 
= {
-// TODO: check that weights size is equal to sum of layers sizes
+val expectedWeightSize = topology.layers.map(_.weightSize).sum
+require(weights.size == expectedWeightSize,
+  s"Expected weight vector of size ${expectedWeightSize} but got size 
${weights.size}.")
 new FeedForwardModel(weights, topology)
   }
 
@@ -559,11 +561,7 @@ private[ann] object FeedForwardModel {
   def apply(topology: FeedForwardTopology, seed: Long = 11L): FeedForwardModel 
= {
 val layers = topology.layers
 val layerModels = new Array[LayerModel](layers.length)
-var totalSize = 0
-for (i <- 0 until topology.layers.length) {
-  totalSize += topology.layers(i).weightSize
-}
-val weights = BDV.zeros[Double](totalSize)
+val weights = BDV.zeros[Double](topology.layers.map(_.weightSize).sum)
 var offset = 0
 val random = new XORShiftRandom(seed)
 for (i <- 0 until layers.length) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17524][TESTS] Use specified spark.buffer.pageSize

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master d15b4f90e -> f893e2625


[SPARK-17524][TESTS] Use specified spark.buffer.pageSize

## What changes were proposed in this pull request?

This PR has the appendRowUntilExceedingPageSize test in 
RowBasedKeyValueBatchSuite use whatever spark.buffer.pageSize value a user has 
specified to prevent a test failure for anyone testing Apache Spark on a box 
with a reduced page size. The test is currently hardcoded to use the default 
page size which is 64 MB so this minor PR is a test improvement

## How was this patch tested?
Existing unit tests with 1 MB page size and with 64 MB (the default) page size

Author: Adam Roberts 

Closes #15079 from a-roberts/patch-5.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f893e262
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f893e262
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f893e262

Branch: refs/heads/master
Commit: f893e262500e2f183de88e984300dd5b085e1f71
Parents: d15b4f9
Author: Adam Roberts 
Authored: Thu Sep 15 09:37:12 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 09:37:12 2016 +0100

--
 .../sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java   | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f893e262/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
--
diff --git 
a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
 
b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
index 0dd129c..fb3dbe8 100644
--- 
a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
+++ 
b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
@@ -338,15 +338,17 @@ public class RowBasedKeyValueBatchSuite {
 
   @Test
   public void appendRowUntilExceedingPageSize() throws Exception {
+// Use default size or spark.buffer.pageSize if specified
+int pageSizeToUse = (int) memoryManager.pageSizeBytes();
 RowBasedKeyValueBatch batch = RowBasedKeyValueBatch.allocate(keySchema,
-valueSchema, taskMemoryManager, 64 * 1024 * 1024); //enough 
capacity
+valueSchema, taskMemoryManager, pageSizeToUse); //enough capacity
 try {
   UnsafeRow key = makeKeyRow(1, "A");
   UnsafeRow value = makeValueRow(1, 1);
   int recordLength = 8 + key.getSizeInBytes() + value.getSizeInBytes() + 8;
   int totalSize = 4;
   int numRows = 0;
-  while (totalSize + recordLength < 64 * 1024 * 1024) { // default page 
size
+  while (totalSize + recordLength < pageSizeToUse) {
 appendRow(batch, key, value);
 totalSize += recordLength;
 numRows++;


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 bb2bdb440 -> 5c2bc8360


[SPARK-17521] Error when I use sparkContext.makeRDD(Seq())

## What changes were proposed in this pull request?

 when i use sc.makeRDD below
```
val data3 = sc.makeRDD(Seq())
println(data3.partitions.length)
```
I got an error:
Exception in thread "main" java.lang.IllegalArgumentException: Positive number 
of slices required

We can fix this bug just modify the last line ,do a check of seq.size
```
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
assertNotStopped()
val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 
defaultParallelism), indexToPrefs)
  }
```

## How was this patch tested?

 manual tests

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: codlife <1004910...@qq.com>
Author: codlife 

Closes #15077 from codlife/master.

(cherry picked from commit 647ee05e5815bde361662a9286ac602c44b4d4e6)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c2bc836
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c2bc836
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c2bc836

Branch: refs/heads/branch-2.0
Commit: 5c2bc8360019fb08e2e62e50bb261f7ce19b231e
Parents: bb2bdb4
Author: codlife <1004910...@qq.com>
Authored: Thu Sep 15 09:38:13 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 09:38:22 2016 +0100

--
 core/src/main/scala/org/apache/spark/SparkContext.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5c2bc836/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 71511b8..214758f 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -788,7 +788,7 @@ class SparkContext(config: SparkConf) extends Logging with 
ExecutorAllocationCli
   def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
 assertNotStopped()
 val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
-new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
+new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 1), 
indexToPrefs)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f893e2625 -> 647ee05e5


[SPARK-17521] Error when I use sparkContext.makeRDD(Seq())

## What changes were proposed in this pull request?

 when i use sc.makeRDD below
```
val data3 = sc.makeRDD(Seq())
println(data3.partitions.length)
```
I got an error:
Exception in thread "main" java.lang.IllegalArgumentException: Positive number 
of slices required

We can fix this bug just modify the last line ,do a check of seq.size
```
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
assertNotStopped()
val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 
defaultParallelism), indexToPrefs)
  }
```

## How was this patch tested?

 manual tests

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: codlife <1004910...@qq.com>
Author: codlife 

Closes #15077 from codlife/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/647ee05e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/647ee05e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/647ee05e

Branch: refs/heads/master
Commit: 647ee05e5815bde361662a9286ac602c44b4d4e6
Parents: f893e26
Author: codlife <1004910...@qq.com>
Authored: Thu Sep 15 09:38:13 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 09:38:13 2016 +0100

--
 core/src/main/scala/org/apache/spark/SparkContext.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/647ee05e/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index e32e4aa..35b6334 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -795,7 +795,7 @@ class SparkContext(config: SparkConf) extends Logging with 
ExecutorAllocationCli
   def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
 assertNotStopped()
 val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
-new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
+new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 1), 
indexToPrefs)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17406][WEB UI] limit timeline executor events

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 647ee05e5 -> ad79fc0a8


[SPARK-17406][WEB UI] limit timeline executor events

## What changes were proposed in this pull request?
The job page will be too slow to open when there are thousands of executor 
events(added or removed). I found that in ExecutorsTab file, executorIdToData 
will not remove elements, it will increase all the time.Before this pr, it 
looks like 
[timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png).
 After this pr, it looks like 
[timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we
 can set how many executor events will be displayed)

Author: cenyuhai 

Closes #14969 from cenyuhai/SPARK-17406.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad79fc0a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad79fc0a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad79fc0a

Branch: refs/heads/master
Commit: ad79fc0a8407a950a03869f2f8cdc3ed0bf13875
Parents: 647ee05
Author: cenyuhai 
Authored: Thu Sep 15 09:58:53 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 09:58:53 2016 +0100

--
 .../apache/spark/ui/exec/ExecutorsPage.scala|  41 +++
 .../org/apache/spark/ui/exec/ExecutorsTab.scala | 112 +++
 .../org/apache/spark/ui/jobs/AllJobsPage.scala  |  66 +--
 .../apache/spark/ui/jobs/ExecutorTable.scala|   3 +-
 .../org/apache/spark/ui/jobs/JobPage.scala  |  67 ++-
 .../org/apache/spark/ui/jobs/StagePage.scala|   4 +-
 .../scala/org/apache/spark/ui/jobs/UIData.scala |   5 -
 project/MimaExcludes.scala  |  12 ++
 8 files changed, 162 insertions(+), 148 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ad79fc0a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala 
b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
index 982e891..7953d77 100644
--- a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
@@ -17,14 +17,12 @@
 
 package org.apache.spark.ui.exec
 
-import java.net.URLEncoder
 import javax.servlet.http.HttpServletRequest
 
 import scala.xml.Node
 
 import org.apache.spark.status.api.v1.ExecutorSummary
-import org.apache.spark.ui.{ToolTips, UIUtils, WebUIPage}
-import org.apache.spark.util.Utils
+import org.apache.spark.ui.{UIUtils, WebUIPage}
 
 // This isn't even used anymore -- but we need to keep it b/c of a MiMa false 
positive
 private[ui] case class ExecutorSummaryInfo(
@@ -83,18 +81,7 @@ private[spark] object ExecutorsPage {
 val memUsed = status.memUsed
 val maxMem = status.maxMem
 val diskUsed = status.diskUsed
-val totalCores = listener.executorToTotalCores.getOrElse(execId, 0)
-val maxTasks = listener.executorToTasksMax.getOrElse(execId, 0)
-val activeTasks = listener.executorToTasksActive.getOrElse(execId, 0)
-val failedTasks = listener.executorToTasksFailed.getOrElse(execId, 0)
-val completedTasks = listener.executorToTasksComplete.getOrElse(execId, 0)
-val totalTasks = activeTasks + failedTasks + completedTasks
-val totalDuration = listener.executorToDuration.getOrElse(execId, 0L)
-val totalGCTime = listener.executorToJvmGCTime.getOrElse(execId, 0L)
-val totalInputBytes = listener.executorToInputBytes.getOrElse(execId, 0L)
-val totalShuffleRead = listener.executorToShuffleRead.getOrElse(execId, 0L)
-val totalShuffleWrite = listener.executorToShuffleWrite.getOrElse(execId, 
0L)
-val executorLogs = listener.executorToLogUrls.getOrElse(execId, Map.empty)
+val taskSummary = listener.executorToTaskSummary.getOrElse(execId, 
ExecutorTaskSummary(execId))
 
 new ExecutorSummary(
   execId,
@@ -103,19 +90,19 @@ private[spark] object ExecutorsPage {
   rddBlocks,
   memUsed,
   diskUsed,
-  totalCores,
-  maxTasks,
-  activeTasks,
-  failedTasks,
-  completedTasks,
-  totalTasks,
-  totalDuration,
-  totalGCTime,
-  totalInputBytes,
-  totalShuffleRead,
-  totalShuffleWrite,
+  taskSummary.totalCores,
+  taskSummary.tasksMax,
+  taskSummary.tasksActive,
+  taskSummary.tasksFailed,
+  taskSummary.tasksComplete,
+  taskSummary.tasksActive + taskSummary.tasksFailed + 
taskSummary.tasksComplete,
+  taskSummary.duration,
+  taskSummary.jvmGCTime,
+  taskSummary.inputBytes,
+  taskSummary.shuffleRead,
+  taskSummary.shuffleWrite,
   maxMem,
-  executorLogs
+  taskSummary.executorLogs
 )
   }
 }

http://git-wip-

spark git commit: [SPARK-17536][SQL] Minor performance improvement to JDBC batch inserts

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master ad79fc0a8 -> 71a65825c


[SPARK-17536][SQL] Minor performance improvement to JDBC batch inserts

## What changes were proposed in this pull request?

Optimize a while loop during batch inserts

## How was this patch tested?

Unit tests were done, specifically "mvn  test" for sql

Author: John Muller 

Closes #15098 from blue666man/SPARK-17536.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71a65825
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71a65825
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71a65825

Branch: refs/heads/master
Commit: 71a65825c5d5d0886ac3e11f9945cfcb39573ac3
Parents: ad79fc0
Author: John Muller 
Authored: Thu Sep 15 10:00:28 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 10:00:28 2016 +0100

--
 .../apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/71a65825/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
index 132472a..b09fd51 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
@@ -590,12 +590,12 @@ object JdbcUtils extends Logging {
   val stmt = insertStatement(conn, table, rddSchema, dialect)
   val setters: Array[JDBCValueSetter] = rddSchema.fields.map(_.dataType)
 .map(makeSetter(conn, dialect, _)).toArray
+  val numFields = rddSchema.fields.length
 
   try {
 var rowCount = 0
 while (iterator.hasNext) {
   val row = iterator.next()
-  val numFields = rddSchema.fields.length
   var i = 0
   while (i < numFields) {
 if (row.isNullAt(i)) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17406][BUILD][HOTFIX] MiMa excludes fix

2016-09-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 71a65825c -> 2ad276954


[SPARK-17406][BUILD][HOTFIX] MiMa excludes fix

## What changes were proposed in this pull request?

Following https://github.com/apache/spark/pull/14969 for some reason the MiMa 
excludes weren't complete, but still passed the PR builder. This adds 3 more 
excludes from 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.2/1749/consoleFull

It also moves the excludes to their own Seq in the build, as they probably 
should have been.
Even though this is merged to 2.1.x only / master, I left the exclude in for 
2.0.x in case we back port. It's a private API so is always a false positive.

## How was this patch tested?

Jenkins build

Author: Sean Owen 

Closes #15110 from srowen/SPARK-17406.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ad27695
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ad27695
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ad27695

Branch: refs/heads/master
Commit: 2ad276954858b0a7b3f442b9e440c72cbb1610e2
Parents: 71a6582
Author: Sean Owen 
Authored: Thu Sep 15 13:54:41 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 15 13:54:41 2016 +0100

--
 project/MimaExcludes.scala | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2ad27695/project/MimaExcludes.scala
--
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 37fff2e..1bdcf9a 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -426,18 +426,6 @@ object MimaExcludes {
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.StorageStatusListener.this"),
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.streaming.scheduler.BatchInfo.streamIdToNumRecords"),
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.storageStatusList"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorIdToData"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToTasksActive"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToTasksComplete"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToInputRecords"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToShuffleRead"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToTasksFailed"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToShuffleWrite"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToDuration"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToInputBytes"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToLogUrls"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToOutputBytes"),
-  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToOutputRecords"),
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.this"),
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.storage.StorageListener.storageStatusList"),
   
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ExceptionFailure.apply"),
@@ -807,6 +795,23 @@ object MimaExcludes {
   // SPARK-17096: Improve exception string reported through the 
StreamingQueryListener
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.StreamingQueryListener#QueryTerminated.stackTrace"),
   
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.StreamingQueryListener#QueryTerminated.this")
+) ++ Seq(
+  // SPARK-17406 limit timeline executor events
+  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.Execu

spark git commit: [SPARK-17543] Missing log4j config file for tests in common/network-…

2016-09-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b72486f82 -> b2e272624


[SPARK-17543] Missing log4j config file for tests in common/network-â¦

## What changes were proposed in this pull request?

The Maven module `common/network-shuffle` does not have a log4j configuration 
file for its test cases. So, added `log4j.properties` in the directory 
`src/test/resources`.

â¦shuffle]

Author: Jagadeesan 

Closes #15108 from jagadeesanas2/SPARK-17543.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2e27262
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2e27262
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2e27262

Branch: refs/heads/master
Commit: b2e27262440015f57bcfa888921c9cc017800910
Parents: b72486f
Author: Jagadeesan 
Authored: Fri Sep 16 10:18:45 2016 +0100
Committer: Sean Owen 
Committed: Fri Sep 16 10:18:45 2016 +0100

--
 .../src/test/resources/log4j.properties | 24 
 1 file changed, 24 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2e27262/common/network-shuffle/src/test/resources/log4j.properties
--
diff --git a/common/network-shuffle/src/test/resources/log4j.properties 
b/common/network-shuffle/src/test/resources/log4j.properties
new file mode 100644
index 000..e739789
--- /dev/null
+++ b/common/network-shuffle/src/test/resources/log4j.properties
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Set everything to be logged to the file target/unit-tests.log
+log4j.rootCategory=DEBUG, file
+log4j.appender.file=org.apache.log4j.FileAppender
+log4j.appender.file.append=true
+log4j.appender.file.file=target/unit-tests.log
+log4j.appender.file.layout=org.apache.log4j.PatternLayout
+log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p 
%c{1}: %m%n


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17534][TESTS] Increase timeouts for DirectKafkaStreamSuite tests

2016-09-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b2e272624 -> fc1efb720


[SPARK-17534][TESTS] Increase timeouts for DirectKafkaStreamSuite tests

**## What changes were proposed in this pull request?**
There are two tests in this suite that are particularly flaky on the following 
hardware:

2x Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz and 16 GB of RAM, 1 TB HDD

This simple PR increases the timeout times and batch duration so they can 
reliably pass

**## How was this patch tested?**
Existing unit tests with the two core box where I was seeing the failures often

Author: Adam Roberts 

Closes #15094 from a-roberts/patch-6.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc1efb72
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc1efb72
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc1efb72

Branch: refs/heads/master
Commit: fc1efb720c9c0033077c3c20ee144d0f757e6bcd
Parents: b2e2726
Author: Adam Roberts 
Authored: Fri Sep 16 10:20:50 2016 +0100
Committer: Sean Owen 
Committed: Fri Sep 16 10:20:50 2016 +0100

--
 .../spark/streaming/kafka010/DirectKafkaStreamSuite.scala| 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fc1efb72/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
--
diff --git 
a/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
 
b/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
index b1d90b8..e04f35e 100644
--- 
a/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
+++ 
b/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
@@ -108,7 +108,7 @@ class DirectKafkaStreamSuite
 val expectedTotal = (data.values.sum * topics.size) - 2
 val kafkaParams = getKafkaParams("auto.offset.reset" -> "earliest")
 
-ssc = new StreamingContext(sparkConf, Milliseconds(200))
+ssc = new StreamingContext(sparkConf, Milliseconds(1000))
 val stream = withClue("Error creating direct stream") {
   KafkaUtils.createDirectStream[String, String](
 ssc,
@@ -150,7 +150,7 @@ class DirectKafkaStreamSuite
   allReceived.addAll(Arrays.asList(rdd.map(r => (r.key, 
r.value)).collect(): _*))
 }
 ssc.start()
-eventually(timeout(2.milliseconds), interval(200.milliseconds)) {
+eventually(timeout(10.milliseconds), interval(1000.milliseconds)) {
   assert(allReceived.size === expectedTotal,
 "didn't get expected number of messages, messages:\n" +
   allReceived.asScala.mkString("\n"))
@@ -172,7 +172,7 @@ class DirectKafkaStreamSuite
 val expectedTotal = (data.values.sum * 2) - 3
 val kafkaParams = getKafkaParams("auto.offset.reset" -> "earliest")
 
-ssc = new StreamingContext(sparkConf, Milliseconds(200))
+ssc = new StreamingContext(sparkConf, Milliseconds(1000))
 val stream = withClue("Error creating direct stream") {
   KafkaUtils.createDirectStream[String, String](
 ssc,
@@ -214,7 +214,7 @@ class DirectKafkaStreamSuite
   allReceived.addAll(Arrays.asList(rdd.map(r => (r.key, 
r.value)).collect(): _*))
 }
 ssc.start()
-eventually(timeout(2.milliseconds), interval(200.milliseconds)) {
+eventually(timeout(10.milliseconds), interval(1000.milliseconds)) {
   assert(allReceived.size === expectedTotal,
 "didn't get expected number of messages, messages:\n" +
   allReceived.asScala.mkString("\n"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Correct fetchsize property name in docs

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 39e2bad6a -> 69cb04969


Correct fetchsize property name in docs

## What changes were proposed in this pull request?

Replace `fetchSize` with `fetchsize` in the docs.

## How was this patch tested?

I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See 
also 
[`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38)
 for the definition of the property.

Author: Daniel Darabos 

Closes #14975 from darabos/patch-3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69cb0496
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69cb0496
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69cb0496

Branch: refs/heads/master
Commit: 69cb0496974737347e2650cda436b39bbd51e581
Parents: 39e2bad
Author: Daniel Darabos 
Authored: Sat Sep 17 12:28:42 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:28:42 2016 +0100

--
 docs/sql-programming-guide.md  | 2 +-
 .../src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala   | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/69cb0496/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 28cc88c..4ac5fae 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1053,7 +1053,7 @@ the Data Sources API. The following options are supported:
   
 
   
-fetchSize
+fetchsize
 
   The JDBC fetch size, which determines how many rows to fetch per round 
trip. This can help performance on JDBC drivers which default to low fetch size 
(eg. Oracle with 10 rows).
 

http://git-wip-us.apache.org/repos/asf/spark/blob/69cb0496/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
index 2d8ee33..10f15ca 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
@@ -289,7 +289,7 @@ class JDBCSuite extends SparkFunSuite
 assert(names(2).equals("mary"))
   }
 
-  test("SELECT first field when fetchSize is two") {
+  test("SELECT first field when fetchsize is two") {
 val names = sql("SELECT NAME FROM fetchtwo").collect().map(x => 
x.getString(0)).sortWith(_ < _)
 assert(names.size === 3)
 assert(names(0).equals("fred"))
@@ -305,7 +305,7 @@ class JDBCSuite extends SparkFunSuite
 assert(ids(2) === 3)
   }
 
-  test("SELECT second field when fetchSize is two") {
+  test("SELECT second field when fetchsize is two") {
 val ids = sql("SELECT THEID FROM fetchtwo").collect().map(x => 
x.getInt(0)).sortWith(_ < _)
 assert(ids.size === 3)
 assert(ids(0) === 1)
@@ -352,7 +352,7 @@ class JDBCSuite extends SparkFunSuite
   urlWithUserAndPass, "TEST.PEOPLE", new Properties()).collect().length 
=== 3)
   }
 
-  test("Basic API with illegal FetchSize") {
+  test("Basic API with illegal fetchsize") {
 val properties = new Properties()
 properties.setProperty(JdbcUtils.JDBC_BATCH_FETCH_SIZE, "-1")
 val e = intercept[SparkException] {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Correct fetchsize property name in docs

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3fce1255a -> 9ff158b81


Correct fetchsize property name in docs

## What changes were proposed in this pull request?

Replace `fetchSize` with `fetchsize` in the docs.

## How was this patch tested?

I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See 
also 
[`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38)
 for the definition of the property.

Author: Daniel Darabos 

Closes #14975 from darabos/patch-3.

(cherry picked from commit 69cb0496974737347e2650cda436b39bbd51e581)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9ff158b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9ff158b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9ff158b8

Branch: refs/heads/branch-2.0
Commit: 9ff158b81224c106d50e087c0d284b0c86c95879
Parents: 3fce125
Author: Daniel Darabos 
Authored: Sat Sep 17 12:28:42 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:29:01 2016 +0100

--
 docs/sql-programming-guide.md  | 2 +-
 .../src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala   | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9ff158b8/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 3b01dc8..0bd0093 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1041,7 +1041,7 @@ the Data Sources API. The following options are supported:
   
 
   
-fetchSize
+fetchsize
 
   The JDBC fetch size, which determines how many rows to fetch per round 
trip. This can help performance on JDBC drivers which default to low fetch size 
(eg. Oracle with 10 rows).
 

http://git-wip-us.apache.org/repos/asf/spark/blob/9ff158b8/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
index 995b120..ec419e4 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
@@ -289,7 +289,7 @@ class JDBCSuite extends SparkFunSuite
 assert(names(2).equals("mary"))
   }
 
-  test("SELECT first field when fetchSize is two") {
+  test("SELECT first field when fetchsize is two") {
 val names = sql("SELECT NAME FROM fetchtwo").collect().map(x => 
x.getString(0)).sortWith(_ < _)
 assert(names.size === 3)
 assert(names(0).equals("fred"))
@@ -305,7 +305,7 @@ class JDBCSuite extends SparkFunSuite
 assert(ids(2) === 3)
   }
 
-  test("SELECT second field when fetchSize is two") {
+  test("SELECT second field when fetchsize is two") {
 val ids = sql("SELECT THEID FROM fetchtwo").collect().map(x => 
x.getInt(0)).sortWith(_ < _)
 assert(ids.size === 3)
 assert(ids(0) === 1)
@@ -352,7 +352,7 @@ class JDBCSuite extends SparkFunSuite
   urlWithUserAndPass, "TEST.PEOPLE", new Properties()).collect().length 
=== 3)
   }
 
-  test("Basic API with illegal FetchSize") {
+  test("Basic API with illegal fetchsize") {
 val properties = new Properties()
 properties.setProperty(JdbcUtils.JDBC_BATCH_FETCH_SIZE, "-1")
 val e = intercept[SparkException] {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17567][DOCS] Use valid url to Spark RDD paper

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9ff158b81 -> 3ca0dc007


[SPARK-17567][DOCS] Use valid url to Spark RDD paper

https://issues.apache.org/jira/browse/SPARK-17567

## What changes were proposed in this pull request?

Documentation 
(http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
contains broken link to Spark paper 
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf).

I found it elsewhere 
(https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and 
I hope it is the same one. It should be uploaded to and linked from some Apache 
controlled storage, so it won't break again.

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren 

Closes #15121 from keypointt/SPARK-17567.

(cherry picked from commit f15d41be3ce7569736ccbf2ffe1bec265865f55d)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ca0dc00
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ca0dc00
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ca0dc00

Branch: refs/heads/branch-2.0
Commit: 3ca0dc00786df1d529d55e297aaf23e1e1e07999
Parents: 9ff158b
Author: Xin Ren 
Authored: Sat Sep 17 12:30:25 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:30:36 2016 +0100

--
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ca0dc00/core/src/main/scala/org/apache/spark/rdd/RDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index 2ee13dc..34d32aa 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -70,7 +70,7 @@ import org.apache.spark.util.random.{BernoulliCellSampler, 
BernoulliSampler, Poi
  * All of the scheduling and execution in Spark is done based on these 
methods, allowing each RDD
  * to implement its own way of computing itself. Indeed, users can implement 
custom RDDs (e.g. for
  * reading data from a new storage system) by overriding these functions. 
Please refer to the
- * [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark 
paper]] for more details
+ * [[http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf Spark 
paper]] for more details
  * on RDD internals.
  */
 abstract class RDD[T: ClassTag](


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17567][DOCS] Use valid url to Spark RDD paper

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 69cb04969 -> f15d41be3


[SPARK-17567][DOCS] Use valid url to Spark RDD paper

https://issues.apache.org/jira/browse/SPARK-17567

## What changes were proposed in this pull request?

Documentation 
(http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
contains broken link to Spark paper 
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf).

I found it elsewhere 
(https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and 
I hope it is the same one. It should be uploaded to and linked from some Apache 
controlled storage, so it won't break again.

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren 

Closes #15121 from keypointt/SPARK-17567.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f15d41be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f15d41be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f15d41be

Branch: refs/heads/master
Commit: f15d41be3ce7569736ccbf2ffe1bec265865f55d
Parents: 69cb049
Author: Xin Ren 
Authored: Sat Sep 17 12:30:25 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:30:25 2016 +0100

--
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f15d41be/core/src/main/scala/org/apache/spark/rdd/RDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index 10b5f82..6dc334c 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -70,7 +70,7 @@ import org.apache.spark.util.random.{BernoulliCellSampler, 
BernoulliSampler, Poi
  * All of the scheduling and execution in Spark is done based on these 
methods, allowing each RDD
  * to implement its own way of computing itself. Indeed, users can implement 
custom RDDs (e.g. for
  * reading data from a new storage system) by overriding these functions. 
Please refer to the
- * [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark 
paper]] for more details
+ * [[http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf Spark 
paper]] for more details
  * on RDD internals.
  */
 abstract class RDD[T: ClassTag](


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17561][DOCS] DataFrameWriter documentation formatting problems

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3ca0dc007 -> c9bd67e94


[SPARK-17561][DOCS] DataFrameWriter documentation formatting problems

Fix ` / ` problems in SQL scaladoc.

Scaladoc build and manual verification of generated HTML.

Author: Sean Owen 

Closes #15117 from srowen/SPARK-17561.

(cherry picked from commit b9323fc9381a09af510f542fd5c86473e029caf6)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c9bd67e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c9bd67e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c9bd67e9

Branch: refs/heads/branch-2.0
Commit: c9bd67e94d9d9d2e1f2cb1e5c4bb71a69b1e1d4e
Parents: 3ca0dc0
Author: Sean Owen 
Authored: Fri Sep 16 13:43:05 2016 -0700
Committer: Sean Owen 
Committed: Sat Sep 17 12:43:30 2016 +0100

--
 .../org/apache/spark/sql/DataFrameReader.scala  | 32 +
 .../org/apache/spark/sql/DataFrameWriter.scala  | 10 ++
 .../spark/sql/streaming/DataStreamReader.scala  | 38 
 3 files changed, 51 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c9bd67e9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 083c2e2..410cb20 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -269,14 +269,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `allowBackslashEscapingAnyCharacter` (default `false`): allows 
accepting quoting of all
* character using backslash quoting mechanism
* `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   * during parsing.
-   * 
-   *   - `PERMISSIVE` : sets other fields to `null` when it meets a 
corrupted record, and puts
-   *  the malformed string into a new field configured by 
`columnNameOfCorruptRecord`. When
-   *  a schema is set by user, it sets `null` for extra fields.
-   *   - `DROPMALFORMED` : ignores the whole corrupted records.
-   *   - `FAILFAST` : throws an exception when it meets corrupted 
records.
-   * 
+   * during parsing.
+   *   
+   * `PERMISSIVE` : sets other fields to `null` when it meets a 
corrupted record, and puts
+   * the malformed string into a new field configured by 
`columnNameOfCorruptRecord`. When
+   * a schema is set by user, it sets `null` for extra fields.
+   * `DROPMALFORMED` : ignores the whole corrupted records.
+   * `FAILFAST` : throws an exception when it meets corrupted 
records.
+   *   
+   * 
* `columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field 
having malformed string
* created by `PERMISSIVE` mode. This overrides 
`spark.sql.columnNameOfCorruptRecord`.
@@ -396,13 +397,14 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `maxMalformedLogPerPartition` (default `10`): sets the maximum number 
of malformed rows
* Spark will log for each partition. Malformed records beyond this number 
will be ignored.
* `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   *during parsing.
-   * 
-   *- `PERMISSIVE` : sets other fields to `null` when it meets a 
corrupted record. When
-   * a schema is set by user, it sets `null` for extra fields.
-   *- `DROPMALFORMED` : ignores the whole corrupted records.
-   *- `FAILFAST` : throws an exception when it meets corrupted 
records.
-   * 
+   *during parsing.
+   *   
+   * `PERMISSIVE` : sets other fields to `null` when it meets a 
corrupted record. When
+   *   a schema is set by user, it sets `null` for extra fields.
+   * `DROPMALFORMED` : ignores the whole corrupted records.
+   * `FAILFAST` : throws an exception when it meets corrupted 
records.
+   *   
+   * 
* 
* @since 2.0.0
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/c9bd67e9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 767af99..a4c4a5d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -449,6 +449,7 @@ final class DataFrameWriter[T] private[

spark git commit: [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f15d41be3 -> 25cbbe6ca


[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects 
the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so that 
it will not spuriously reject the best match when invoked with a vector that 
does not correspond to a word in the model's vocabulary.  Instead of blindly 
discarding the best match, the changed implementation discards a match that 
corresponds to the query word (in cases where `findSynonyms` is invoked with a 
word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most similar 
vector from a supplied vector would not be spuriously rejected.

Author: William Benton 

Closes #15105 from willb/fix/findSynonyms.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25cbbe6c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25cbbe6c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25cbbe6c

Branch: refs/heads/master
Commit: 25cbbe6ca334140204e7035ab8b9d304da9b8a8a
Parents: f15d41b
Author: William Benton 
Authored: Sat Sep 17 12:49:58 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:49:58 2016 +0100

--
 .../org/apache/spark/ml/feature/Word2Vec.scala  | 20 ++-
 .../mllib/api/python/Word2VecModelWrapper.scala | 22 ++--
 .../apache/spark/mllib/feature/Word2Vec.scala   | 37 +++-
 .../spark/mllib/feature/Word2VecSuite.scala | 16 +
 python/pyspark/mllib/feature.py | 12 +--
 5 files changed, 83 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/25cbbe6c/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index c2b434c..14c0512 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -221,24 +221,26 @@ class Word2VecModel private[ml] (
   }
 
   /**
-   * Find "num" number of words closest in similarity to the given word.
-   * Returns a dataframe with the words and the cosine similarities between the
-   * synonyms and the given word.
+   * Find "num" number of words closest in similarity to the given word, not
+   * including the word itself. Returns a dataframe with the words and the
+   * cosine similarities between the synonyms and the given word.
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+val spark = SparkSession.builder().getOrCreate()
+spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
   }
 
   /**
-   * Find "num" number of words closest to similarity to the given vector 
representation
-   * of the word. Returns a dataframe with the words and the cosine 
similarities between the
-   * synonyms and the given word vector.
+   * Find "num" number of words whose vector representation most similar to 
the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.  Returns a dataframe with the words and 
the cosine
+   * similarities between the synonyms and the given word vector.
*/
   @Since("2.0.0")
-  def findSynonyms(word: Vector, num: Int): DataFrame = {
+  def findSynonyms(vec: Vector, num: Int): DataFrame = {
 val spark = SparkSession.builder().getOrCreate()
-spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
+spark.createDataFrame(wordVectors.findSynonyms(vec, num)).toDF("word", 
"similarity")
   }
 
   /** @group setParam */

http://git-wip-us.apache.org/repos/asf/spark/blob/25cbbe6c/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
index 4b4ed22..5cbfbff 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
@@ -43,18 +43,34 @@ private[python] class Word2VecModelWrapper(model: 
Word2VecModel) {
 rdd.rdd.map(model.transform)
   }
 
+  /**
+   * Finds sy

spark git commit: [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 c9bd67e94 -> eb2675de9


[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects 
the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so that 
it will not spuriously reject the best match when invoked with a vector that 
does not correspond to a word in the model's vocabulary.  Instead of blindly 
discarding the best match, the changed implementation discards a match that 
corresponds to the query word (in cases where `findSynonyms` is invoked with a 
word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most similar 
vector from a supplied vector would not be spuriously rejected.

Author: William Benton 

Closes #15105 from willb/fix/findSynonyms.

(cherry picked from commit 25cbbe6ca334140204e7035ab8b9d304da9b8a8a)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb2675de
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb2675de
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb2675de

Branch: refs/heads/branch-2.0
Commit: eb2675de92b865852d7aa3ef25a20e6cff940299
Parents: c9bd67e
Author: William Benton 
Authored: Sat Sep 17 12:49:58 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 12:50:09 2016 +0100

--
 .../org/apache/spark/ml/feature/Word2Vec.scala  | 20 ++-
 .../mllib/api/python/Word2VecModelWrapper.scala | 22 ++--
 .../apache/spark/mllib/feature/Word2Vec.scala   | 37 +++-
 .../spark/mllib/feature/Word2VecSuite.scala | 16 +
 python/pyspark/mllib/feature.py | 12 +--
 5 files changed, 83 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb2675de/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index c2b434c..14c0512 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -221,24 +221,26 @@ class Word2VecModel private[ml] (
   }
 
   /**
-   * Find "num" number of words closest in similarity to the given word.
-   * Returns a dataframe with the words and the cosine similarities between the
-   * synonyms and the given word.
+   * Find "num" number of words closest in similarity to the given word, not
+   * including the word itself. Returns a dataframe with the words and the
+   * cosine similarities between the synonyms and the given word.
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+val spark = SparkSession.builder().getOrCreate()
+spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
   }
 
   /**
-   * Find "num" number of words closest to similarity to the given vector 
representation
-   * of the word. Returns a dataframe with the words and the cosine 
similarities between the
-   * synonyms and the given word vector.
+   * Find "num" number of words whose vector representation most similar to 
the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.  Returns a dataframe with the words and 
the cosine
+   * similarities between the synonyms and the given word vector.
*/
   @Since("2.0.0")
-  def findSynonyms(word: Vector, num: Int): DataFrame = {
+  def findSynonyms(vec: Vector, num: Int): DataFrame = {
 val spark = SparkSession.builder().getOrCreate()
-spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
+spark.createDataFrame(wordVectors.findSynonyms(vec, num)).toDF("word", 
"similarity")
   }
 
   /** @group setParam */

http://git-wip-us.apache.org/repos/asf/spark/blob/eb2675de/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
index 4b4ed22..5cbfbff 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala
@@ -43,18 +43,34 @@ private[python] class Word

spark git commit: [SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joins

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 25cbbe6ca -> 9dbd4b864


[SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joins

## What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() 
method).
Use this method to clear the subset of the BitSet which needs to be used during 
merge joins.

## How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil 
rather than BitSet.clear for normally shaped (unskewed) joins (additional read 
on the last long).  This is expected to be de-minimis and was not specifically 
tested.

Author: David Navas 

Closes #15084 from davidnavas/bitSet.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9dbd4b86
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9dbd4b86
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9dbd4b86

Branch: refs/heads/master
Commit: 9dbd4b864efacd09a8353d00c998be87f9eeacb2
Parents: 25cbbe6
Author: David Navas 
Authored: Sat Sep 17 16:22:23 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 16:22:23 2016 +0100

--
 .../apache/spark/util/collection/BitSet.scala   | 28 +++--
 .../spark/util/collection/BitSetSuite.scala | 32 
 .../sql/execution/joins/SortMergeJoinExec.scala |  4 +--
 3 files changed, 52 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9dbd4b86/core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/collection/BitSet.scala 
b/core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
index 7ab67fc..e63e0e3 100644
--- a/core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
+++ b/core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.util.collection
 
+import java.util.Arrays
+
 /**
  * A simple, fixed-size bit set implementation. This implementation is fast 
because it avoids
  * safety/bound checking.
@@ -35,21 +37,14 @@ class BitSet(numBits: Int) extends Serializable {
   /**
* Clear all set bits.
*/
-  def clear(): Unit = {
-var i = 0
-while (i < numWords) {
-  words(i) = 0L
-  i += 1
-}
-  }
+  def clear(): Unit = Arrays.fill(words, 0)
 
   /**
* Set all the bits up to a given index
*/
-  def setUntil(bitIndex: Int) {
+  def setUntil(bitIndex: Int): Unit = {
 val wordIndex = bitIndex >> 6 // divide by 64
-var i = 0
-while(i < wordIndex) { words(i) = -1; i += 1 }
+Arrays.fill(words, 0, wordIndex, -1)
 if(wordIndex < words.length) {
   // Set the remaining bits (note that the mask could still be zero)
   val mask = ~(-1L << (bitIndex & 0x3f))
@@ -58,6 +53,19 @@ class BitSet(numBits: Int) extends Serializable {
   }
 
   /**
+   * Clear all the bits up to a given index
+   */
+  def clearUntil(bitIndex: Int): Unit = {
+val wordIndex = bitIndex >> 6 // divide by 64
+Arrays.fill(words, 0, wordIndex, 0)
+if(wordIndex < words.length) {
+  // Clear the remaining bits
+  val mask = -1L << (bitIndex & 0x3f)
+  words(wordIndex) &= mask
+}
+  }
+
+  /**
* Compute the bit-wise AND of the two sets returning the
* result.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/9dbd4b86/core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala 
b/core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala
index 69dbfa9..0169c99 100644
--- a/core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala
@@ -152,4 +152,36 @@ class BitSetSuite extends SparkFunSuite {
 assert(bitsetDiff.nextSetBit(85) === 85)
 assert(bitsetDiff.nextSetBit(86) === -1)
   }
+
+  test( "[gs]etUntil" ) {
+val bitSet = new BitSet(100)
+
+bitSet.setUntil(bitSet.capacity)
+
+(0 until bitSet.capacity).foreach { i =>
+  assert(bitSet.get(i))
+}
+
+bitSet.clearUntil(bitSet.capacity)
+
+(0 until bitSet.capacity).foreach { i =>
+  assert(!bitSet.get(i))
+}
+
+val setUntil = bitSet.capacity / 2
+bitSet.setUntil(setUntil)
+
+val clearUntil = setUntil / 2
+bitSet.clearUntil(clearUntil)
+
+(0 until clearUntil).foreach { i =>
+  assert(!bitSet.get(i))
+}
+(clearUntil until setUntil).foreach { i =>
+  assert(bitSet.get(i))
+}
+

spark git commit: [SPARK-17575][DOCS] Remove extra table tags in configuration document

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 eb2675de9 -> ec2b73656


[SPARK-17575][DOCS] Remove extra table tags in configuration document

## What changes were proposed in this pull request?

Remove extra table tags in configurations document.

## How was this patch tested?

Run all test cases and generate document.

Before with extra tag its look like below
![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png)

![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png)

After removing tags its looks like below

![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png)

![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png)

Author: sandy 

Closes #15130 from phalodi/SPARK-17575.

(cherry picked from commit bbe0b1d623741decce98827130cc67eb1fff1240)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec2b7365
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec2b7365
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec2b7365

Branch: refs/heads/branch-2.0
Commit: ec2b736566b69a1549791f3d86b55cb0249a757d
Parents: eb2675d
Author: sandy 
Authored: Sat Sep 17 16:25:03 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 16:25:14 2016 +0100

--
 docs/configuration.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ec2b7365/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index d37da02..db088dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -123,6 +123,7 @@ of the most common options to set are:
 Number of cores to use for the driver process, only in cluster mode.
   
 
+
   spark.driver.maxResultSize
   1g
   
@@ -217,7 +218,7 @@ Apart from these, the following properties are also 
available, and may be useful
 Note: In client mode, this config must not be set through 
the SparkConf
 directly in your application, because the driver JVM has already started 
at that point.
 Instead, please set this through the --driver-class-path 
command line option or in
-your default properties file.
+your default properties file.
   
 
 
@@ -244,7 +245,7 @@ Apart from these, the following properties are also 
available, and may be useful
 Note: In client mode, this config must not be set through 
the SparkConf
 directly in your application, because the driver JVM has already started 
at that point.
 Instead, please set this through the --driver-library-path 
command line option or in
-your default properties file.
+your default properties file.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17575][DOCS] Remove extra table tags in configuration document

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 9dbd4b864 -> bbe0b1d62


[SPARK-17575][DOCS] Remove extra table tags in configuration document

## What changes were proposed in this pull request?

Remove extra table tags in configurations document.

## How was this patch tested?

Run all test cases and generate document.

Before with extra tag its look like below
![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png)

![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png)

After removing tags its looks like below

![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png)

![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png)

Author: sandy 

Closes #15130 from phalodi/SPARK-17575.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bbe0b1d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bbe0b1d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bbe0b1d6

Branch: refs/heads/master
Commit: bbe0b1d623741decce98827130cc67eb1fff1240
Parents: 9dbd4b8
Author: sandy 
Authored: Sat Sep 17 16:25:03 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 16:25:03 2016 +0100

--
 docs/configuration.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bbe0b1d6/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 8aea745..b505653 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -123,6 +123,7 @@ of the most common options to set are:
 Number of cores to use for the driver process, only in cluster mode.
   
 
+
   spark.driver.maxResultSize
   1g
   
@@ -217,7 +218,7 @@ Apart from these, the following properties are also 
available, and may be useful
 Note: In client mode, this config must not be set through 
the SparkConf
 directly in your application, because the driver JVM has already started 
at that point.
 Instead, please set this through the --driver-class-path 
command line option or in
-your default properties file.
+your default properties file.
   
 
 
@@ -244,7 +245,7 @@ Apart from these, the following properties are also 
available, and may be useful
 Note: In client mode, this config must not be set through 
the SparkConf
 directly in your application, because the driver JVM has already started 
at that point.
 Instead, please set this through the --driver-library-path 
command line option or in
-your default properties file.
+your default properties file.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n)

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master bbe0b1d62 -> 86c2d393a


[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size 
which is O(n)

## What changes were proposed in this pull request?

This PR fixes all the instances which was fixed in the previous PR.

To make sure, I manually debugged and also checked the Scala source. `length` 
in 
[LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57)
 is O(n). Also, `size` calls `length` via 
[SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).

For debugging, I have created these as below:

```scala
ArrayBuffer(1, 2, 3)
Array(1, 2, 3)
List(1, 2, 3)
Seq(1, 2, 3)
```

and then called `size` and `length` for each to debug.

## How was this patch tested?

I ran the bash as below on Mac

```bash
find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep 
"src/main"
find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep 
"src/main"
```

and then checked each.

Author: hyukjinkwon 

Closes #15093 from HyukjinKwon/SPARK-17480-followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86c2d393
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86c2d393
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86c2d393

Branch: refs/heads/master
Commit: 86c2d393a56bf1e5114bc5a781253c0460efb8af
Parents: bbe0b1d
Author: hyukjinkwon 
Authored: Sat Sep 17 16:52:30 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 16:52:30 2016 +0100

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  | 28 +++-
 .../expressions/conditionalExpressions.scala|  3 ++-
 .../sql/catalyst/expressions/ordering.scala |  3 ++-
 .../sql/catalyst/util/QuantileSummaries.scala   | 10 +++
 .../execution/datasources/jdbc/JdbcUtils.scala  |  2 +-
 .../apache/spark/sql/hive/HiveInspectors.scala  |  6 +++--
 .../org/apache/spark/sql/hive/TableReader.scala |  3 ++-
 .../org/apache/spark/sql/hive/hiveUDFs.scala|  3 ++-
 .../spark/sql/hive/orc/OrcFileFormat.scala  |  6 +++--
 9 files changed, 31 insertions(+), 33 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/86c2d393/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 5210f42..cc62d5e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1663,27 +1663,17 @@ class Analyzer(
 }
   }.toSeq
 
-  // Third, for every Window Spec, we add a Window operator and set 
currentChild as the
-  // child of it.
-  var currentChild = child
-  var i = 0
-  while (i < groupedWindowExpressions.size) {
-val ((partitionSpec, orderSpec), windowExpressions) = 
groupedWindowExpressions(i)
-// Set currentChild to the newly created Window operator.
-currentChild =
-  Window(
-windowExpressions,
-partitionSpec,
-orderSpec,
-currentChild)
-
-// Move to next Window Spec.
-i += 1
-  }
+  // Third, we aggregate them by adding each Window operator for each 
Window Spec and then
+  // setting this to the child of the next Window operator.
+  val windowOps =
+groupedWindowExpressions.foldLeft(child) {
+  case (last, ((partitionSpec, orderSpec), windowExpressions)) =>
+Window(windowExpressions, partitionSpec, orderSpec, last)
+}
 
-  // Finally, we create a Project to output currentChild's output
+  // Finally, we create a Project to output windowOps's output
   // newExpressionsWithWindowFunctions.
-  Project(currentChild.output ++ newExpressionsWithWindowFunctions, 
currentChild)
+  Project(windowOps.output ++ newExpressionsWithWindowFunctions, windowOps)
 } // end of addWindow
 
 // We have to use transformDown at here to make sure the rule of

http://git-wip-us.apache.org/repos/asf/spark/blob/86c2d393/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
index

spark git commit: [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n)

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ec2b73656 -> a3bba372a


[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size 
which is O(n)

This PR fixes all the instances which was fixed in the previous PR.

To make sure, I manually debugged and also checked the Scala source. `length` 
in 
[LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57)
 is O(n). Also, `size` calls `length` via 
[SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).

For debugging, I have created these as below:

```scala
ArrayBuffer(1, 2, 3)
Array(1, 2, 3)
List(1, 2, 3)
Seq(1, 2, 3)
```

and then called `size` and `length` for each to debug.

I ran the bash as below on Mac

```bash
find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep 
"src/main"
find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep 
"src/main"
```

and then checked each.

Author: hyukjinkwon 

Closes #15093 from HyukjinKwon/SPARK-17480-followup.

(cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3bba372
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3bba372
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3bba372

Branch: refs/heads/branch-2.0
Commit: a3bba372abce926351335d0a2936b70988f19b23
Parents: ec2b736
Author: hyukjinkwon 
Authored: Sat Sep 17 16:52:30 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 17:06:44 2016 +0100

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  | 28 +++-
 .../expressions/conditionalExpressions.scala|  3 ++-
 .../sql/catalyst/expressions/ordering.scala |  3 ++-
 .../sql/catalyst/util/QuantileSummaries.scala   |  0
 .../apache/spark/sql/hive/HiveInspectors.scala  |  6 +++--
 .../org/apache/spark/sql/hive/TableReader.scala |  3 ++-
 .../org/apache/spark/sql/hive/hiveUDFs.scala|  3 ++-
 .../spark/sql/hive/orc/OrcFileFormat.scala  |  6 +++--
 8 files changed, 25 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a3bba372/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 14e995e..3e4c769 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1649,27 +1649,17 @@ class Analyzer(
 }
   }.toSeq
 
-  // Third, for every Window Spec, we add a Window operator and set 
currentChild as the
-  // child of it.
-  var currentChild = child
-  var i = 0
-  while (i < groupedWindowExpressions.size) {
-val ((partitionSpec, orderSpec), windowExpressions) = 
groupedWindowExpressions(i)
-// Set currentChild to the newly created Window operator.
-currentChild =
-  Window(
-windowExpressions,
-partitionSpec,
-orderSpec,
-currentChild)
-
-// Move to next Window Spec.
-i += 1
-  }
+  // Third, we aggregate them by adding each Window operator for each 
Window Spec and then
+  // setting this to the child of the next Window operator.
+  val windowOps =
+groupedWindowExpressions.foldLeft(child) {
+  case (last, ((partitionSpec, orderSpec), windowExpressions)) =>
+Window(windowExpressions, partitionSpec, orderSpec, last)
+}
 
-  // Finally, we create a Project to output currentChild's output
+  // Finally, we create a Project to output windowOps's output
   // newExpressionsWithWindowFunctions.
-  Project(currentChild.output ++ newExpressionsWithWindowFunctions, 
currentChild)
+  Project(windowOps.output ++ newExpressionsWithWindowFunctions, windowOps)
 } // end of addWindow
 
 // We have to use transformDown at here to make sure the rule of

http://git-wip-us.apache.org/repos/asf/spark/blob/a3bba372/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
index 5f2585f..f9499cf 100644
--- 
a/sql/catalyst/

spark git commit: [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n)

2016-09-17 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 0cfc0469b -> 5fd354b2d


[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size 
which is O(n)

This PR fixes all the instances which was fixed in the previous PR.

To make sure, I manually debugged and also checked the Scala source. `length` 
in 
[LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57)
 is O(n). Also, `size` calls `length` via 
[SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).

For debugging, I have created these as below:

```scala
ArrayBuffer(1, 2, 3)
Array(1, 2, 3)
List(1, 2, 3)
Seq(1, 2, 3)
```

and then called `size` and `length` for each to debug.

I ran the bash as below on Mac

```bash
find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep 
"src/main"
find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep 
"src/main"
```

and then checked each.

Author: hyukjinkwon 

Closes #15093 from HyukjinKwon/SPARK-17480-followup.

(cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5fd354b2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5fd354b2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5fd354b2

Branch: refs/heads/branch-2.0
Commit: 5fd354b2d628130a74c9d01adc7ab6bef65fbd9a
Parents: 0cfc046
Author: hyukjinkwon 
Authored: Sat Sep 17 16:52:30 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 17 22:27:22 2016 +0100

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  | 28 +++-
 .../expressions/conditionalExpressions.scala|  3 ++-
 .../sql/catalyst/expressions/ordering.scala |  3 ++-
 .../apache/spark/sql/hive/HiveInspectors.scala  |  6 +++--
 .../org/apache/spark/sql/hive/TableReader.scala |  3 ++-
 .../org/apache/spark/sql/hive/hiveUDFs.scala|  3 ++-
 .../spark/sql/hive/orc/OrcFileFormat.scala  |  6 +++--
 7 files changed, 25 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5fd354b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 14e995e..3e4c769 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1649,27 +1649,17 @@ class Analyzer(
 }
   }.toSeq
 
-  // Third, for every Window Spec, we add a Window operator and set 
currentChild as the
-  // child of it.
-  var currentChild = child
-  var i = 0
-  while (i < groupedWindowExpressions.size) {
-val ((partitionSpec, orderSpec), windowExpressions) = 
groupedWindowExpressions(i)
-// Set currentChild to the newly created Window operator.
-currentChild =
-  Window(
-windowExpressions,
-partitionSpec,
-orderSpec,
-currentChild)
-
-// Move to next Window Spec.
-i += 1
-  }
+  // Third, we aggregate them by adding each Window operator for each 
Window Spec and then
+  // setting this to the child of the next Window operator.
+  val windowOps =
+groupedWindowExpressions.foldLeft(child) {
+  case (last, ((partitionSpec, orderSpec), windowExpressions)) =>
+Window(windowExpressions, partitionSpec, orderSpec, last)
+}
 
-  // Finally, we create a Project to output currentChild's output
+  // Finally, we create a Project to output windowOps's output
   // newExpressionsWithWindowFunctions.
-  Project(currentChild.output ++ newExpressionsWithWindowFunctions, 
currentChild)
+  Project(windowOps.output ++ newExpressionsWithWindowFunctions, windowOps)
 } // end of addWindow
 
 // We have to use transformDown at here to make sure the rule of

http://git-wip-us.apache.org/repos/asf/spark/blob/5fd354b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
index 5f2585f..f9499cf 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressio

spark-website git commit: replace with valid url to rdd paper

2016-09-17 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site a78faf582 -> eee58685c


replace with valid url to rdd paper


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/eee58685
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/eee58685
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/eee58685

Branch: refs/heads/asf-site
Commit: eee58685c39269c191a921c39f1520c747a42318
Parents: a78faf5
Author: Xin Ren 
Authored: Fri Sep 16 16:31:23 2016 -0700
Committer: Xin Ren 
Committed: Fri Sep 16 16:31:23 2016 -0700

--
 research.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/eee58685/research.md
--
diff --git a/research.md b/research.md
index 41841a1..ec7dd54 100644
--- a/research.md
+++ b/research.md
@@ -27,7 +27,7 @@ Traditional MapReduce and DAG engines are suboptimal for 
these applications beca
 
 
 
-Spark offers an abstraction called http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf";>resilient
 distributed datasets (RDDs) to support these applications 
efficiently. RDDs can be stored in memory between queries without 
requiring replication.  Instead, they rebuild lost data on failure using 
lineage: each RDD remembers how it was built from other datasets (by 
transformations like map, join or 
groupBy) to rebuild itself.  RDDs allow Spark to outperform 
existing models by up to 100x in multi-pass analytics. We showed that RDDs can 
support a wide variety of iterative algorithms, as well as interactive data 
mining and a highly efficient SQL engine (http://shark.cs.berkeley.edu";>Shark).
+Spark offers an abstraction called http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf";>resilient
 distributed datasets (RDDs) to support these applications 
efficiently. RDDs can be stored in memory between queries without 
requiring replication.  Instead, they rebuild lost data on failure using 
lineage: each RDD remembers how it was built from other datasets (by 
transformations like map, join or 
groupBy) to rebuild itself.  RDDs allow Spark to outperform 
existing models by up to 100x in multi-pass analytics. We showed that RDDs can 
support a wide variety of iterative algorithms, as well as interactive data 
mining and a highly efficient SQL engine (http://shark.cs.berkeley.edu";>Shark).
 
 
 You can find more about the research behind Spark in the 
following papers:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17506][SQL] Improve the check double values equality rule.

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 3fe630d31 -> 5d3f4615f


[SPARK-17506][SQL] Improve the check double values equality rule.

## What changes were proposed in this pull request?

In `ExpressionEvalHelper`, we check the equality between two double values by 
comparing whether the expected value is within the range [target - tolerance, 
target + tolerance], but this can cause a negative false when the compared 
numerics are very large.
Beforeï¼
```
val1 = 1.6358558070241E306
val2 = 1.6358558070240974E306
ExpressionEvalHelper.compareResults(val1, val2)
false
```
In fact, `val1` and `val2` are but with different precisions, we should 
tolerant this case by comparing with percentage range, eg.,expected is within 
range [target - target * tolerance_percentage, target + target * 
tolerance_percentage].
After:
```
val1 = 1.6358558070241E306
val2 = 1.6358558070240974E306
ExpressionEvalHelper.compareResults(val1, val2)
true
```

## How was this patch tested?

Exsiting testcases.

Author: jiangxingbo 

Closes #15059 from jiangxb1987/deq.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d3f4615
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d3f4615
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d3f4615

Branch: refs/heads/master
Commit: 5d3f4615f8d0a19b97cde5ae603f74aef2cc2fd2
Parents: 3fe630d
Author: jiangxingbo 
Authored: Sun Sep 18 16:04:37 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 16:04:37 2016 +0100

--
 .../expressions/ArithmeticExpressionSuite.scala |  8 ++
 .../expressions/ExpressionEvalHelper.scala  | 29 ++--
 2 files changed, 30 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5d3f4615/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
index 6873875..5c98242 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
@@ -170,11 +170,9 @@ class ArithmeticExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper
 checkEvaluation(Remainder(positiveLongLit, positiveLongLit), 0L)
 checkEvaluation(Remainder(negativeLongLit, negativeLongLit), 0L)
 
-// TODO: the following lines would fail the test due to inconsistency 
result of interpret
-// and codegen for remainder between giant values, seems like a numeric 
stability issue
-// DataTypeTestUtils.numericTypeWithoutDecimal.foreach { tpe =>
-//  checkConsistencyBetweenInterpretedAndCodegen(Remainder, tpe, tpe)
-// }
+DataTypeTestUtils.numericTypeWithoutDecimal.foreach { tpe =>
+  checkConsistencyBetweenInterpretedAndCodegen(Remainder, tpe, tpe)
+}
   }
 
   test("Abs") {

http://git-wip-us.apache.org/repos/asf/spark/blob/5d3f4615/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
index 668543a..f0c149c 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.scalacheck.Gen
 import org.scalactic.TripleEqualsSupport.Spread
+import org.scalatest.exceptions.TestFailedException
 import org.scalatest.prop.GeneratorDrivenPropertyChecks
 
 import org.apache.spark.SparkFunSuite
@@ -289,13 +290,37 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks {
 (result, expected) match {
   case (result: Array[Byte], expected: Array[Byte]) =>
 java.util.Arrays.equals(result, expected)
-  case (result: Double, expected: Spread[Double @unchecked]) =>
-expected.asInstanceOf[Spread[Double]].isWithin(result)
   case (result: Double, expected: Double) if result.isNaN && 
expected.isNaN =>
 true
+  case (result: Double, expected: Double) =>
+relativeErrorComparison(result, expected)
   case (result: Float, expected: Float) if result.isNaN && expected.isNaN 
=>
 true
   case _ => result

spark git commit: [SPARK-17546][DEPLOY] start-* scripts should use hostname -f

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 5d3f4615f -> 342c0e65b


[SPARK-17546][DEPLOY] start-* scripts should use hostname -f

## What changes were proposed in this pull request?

Call `hostname -f` to get fully qualified host name

## How was this patch tested?

Jenkins tests of course, but also verified output of command on OS X and Linux

Author: Sean Owen 

Closes #15129 from srowen/SPARK-17546.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/342c0e65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/342c0e65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/342c0e65

Branch: refs/heads/master
Commit: 342c0e65bec4b9a715017089ab6ea127f3c46540
Parents: 5d3f461
Author: Sean Owen 
Authored: Sun Sep 18 16:22:31 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 16:22:31 2016 +0100

--
 sbin/start-master.sh   | 2 +-
 sbin/start-mesos-dispatcher.sh | 2 +-
 sbin/start-slaves.sh   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/342c0e65/sbin/start-master.sh
--
diff --git a/sbin/start-master.sh b/sbin/start-master.sh
index 981cb15..d970fcc 100755
--- a/sbin/start-master.sh
+++ b/sbin/start-master.sh
@@ -48,7 +48,7 @@ if [ "$SPARK_MASTER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MASTER_HOST" = "" ]; then
-  SPARK_MASTER_HOST=`hostname`
+  SPARK_MASTER_HOST=`hostname -f`
 fi
 
 if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then

http://git-wip-us.apache.org/repos/asf/spark/blob/342c0e65/sbin/start-mesos-dispatcher.sh
--
diff --git a/sbin/start-mesos-dispatcher.sh b/sbin/start-mesos-dispatcher.sh
index 06a966d..ef65fb9 100755
--- a/sbin/start-mesos-dispatcher.sh
+++ b/sbin/start-mesos-dispatcher.sh
@@ -34,7 +34,7 @@ if [ "$SPARK_MESOS_DISPATCHER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MESOS_DISPATCHER_HOST" = "" ]; then
-  SPARK_MESOS_DISPATCHER_HOST=`hostname`
+  SPARK_MESOS_DISPATCHER_HOST=`hostname -f`
 fi
 
 if [ "$SPARK_MESOS_DISPATCHER_NUM" = "" ]; then

http://git-wip-us.apache.org/repos/asf/spark/blob/342c0e65/sbin/start-slaves.sh
--
diff --git a/sbin/start-slaves.sh b/sbin/start-slaves.sh
index 0fa1605..7d88712 100755
--- a/sbin/start-slaves.sh
+++ b/sbin/start-slaves.sh
@@ -32,7 +32,7 @@ if [ "$SPARK_MASTER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MASTER_HOST" = "" ]; then
-  SPARK_MASTER_HOST="`hostname`"
+  SPARK_MASTER_HOST="`hostname -f`"
 fi
 
 # Launch the slaves


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17546][DEPLOY] start-* scripts should use hostname -f

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 cf728b0f2 -> 5619f095b


[SPARK-17546][DEPLOY] start-* scripts should use hostname -f

## What changes were proposed in this pull request?

Call `hostname -f` to get fully qualified host name

## How was this patch tested?

Jenkins tests of course, but also verified output of command on OS X and Linux

Author: Sean Owen 

Closes #15129 from srowen/SPARK-17546.

(cherry picked from commit 342c0e65bec4b9a715017089ab6ea127f3c46540)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5619f095
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5619f095
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5619f095

Branch: refs/heads/branch-2.0
Commit: 5619f095bfac76009758b4f4a4f8c9e319eeb5b1
Parents: cf728b0
Author: Sean Owen 
Authored: Sun Sep 18 16:22:31 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 16:22:40 2016 +0100

--
 sbin/start-master.sh   | 2 +-
 sbin/start-mesos-dispatcher.sh | 2 +-
 sbin/start-slaves.sh   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5619f095/sbin/start-master.sh
--
diff --git a/sbin/start-master.sh b/sbin/start-master.sh
index 981cb15..d970fcc 100755
--- a/sbin/start-master.sh
+++ b/sbin/start-master.sh
@@ -48,7 +48,7 @@ if [ "$SPARK_MASTER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MASTER_HOST" = "" ]; then
-  SPARK_MASTER_HOST=`hostname`
+  SPARK_MASTER_HOST=`hostname -f`
 fi
 
 if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then

http://git-wip-us.apache.org/repos/asf/spark/blob/5619f095/sbin/start-mesos-dispatcher.sh
--
diff --git a/sbin/start-mesos-dispatcher.sh b/sbin/start-mesos-dispatcher.sh
index 06a966d..ef65fb9 100755
--- a/sbin/start-mesos-dispatcher.sh
+++ b/sbin/start-mesos-dispatcher.sh
@@ -34,7 +34,7 @@ if [ "$SPARK_MESOS_DISPATCHER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MESOS_DISPATCHER_HOST" = "" ]; then
-  SPARK_MESOS_DISPATCHER_HOST=`hostname`
+  SPARK_MESOS_DISPATCHER_HOST=`hostname -f`
 fi
 
 if [ "$SPARK_MESOS_DISPATCHER_NUM" = "" ]; then

http://git-wip-us.apache.org/repos/asf/spark/blob/5619f095/sbin/start-slaves.sh
--
diff --git a/sbin/start-slaves.sh b/sbin/start-slaves.sh
index 0fa1605..7d88712 100755
--- a/sbin/start-slaves.sh
+++ b/sbin/start-slaves.sh
@@ -32,7 +32,7 @@ if [ "$SPARK_MASTER_PORT" = "" ]; then
 fi
 
 if [ "$SPARK_MASTER_HOST" = "" ]; then
-  SPARK_MASTER_HOST="`hostname`"
+  SPARK_MASTER_HOST="`hostname -f`"
 fi
 
 # Launch the slaves


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17586][BUILD] Do not call static member via instance reference

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 342c0e65b -> 7151011b3


[SPARK-17586][BUILD] Do not call static member via instance reference

## What changes were proposed in this pull request?

This PR fixes a warning message as below:

```
[WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method 
should be qualified by type name, TaskMemoryManager, instead of by an expression
[WARNING]   currentPageNumber = 
memoryManager.decodePageNumber(recordPointer)
```

by referencing the static member via class not instance reference.

## How was this patch tested?

Existing tests should cover this - Jenkins tests.

Author: hyukjinkwon 

Closes #15141 from HyukjinKwon/SPARK-17586.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7151011b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7151011b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7151011b

Branch: refs/heads/master
Commit: 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f
Parents: 342c0e6
Author: hyukjinkwon 
Authored: Sun Sep 18 19:18:49 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 19:18:49 2016 +0100

--
 .../spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7151011b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
index be38295..3b1ece4 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
@@ -281,7 +281,7 @@ public final class UnsafeInMemorySorter {
 public void loadNext() {
   // This pointer points to a 4-byte record length, followed by the 
record's bytes
   final long recordPointer = array.get(offset + position);
-  currentPageNumber = memoryManager.decodePageNumber(recordPointer);
+  currentPageNumber = TaskMemoryManager.decodePageNumber(recordPointer);
   baseObject = memoryManager.getPage(recordPointer);
   baseOffset = memoryManager.getOffsetInPage(recordPointer) + 4;  // Skip 
over record length
   recordLength = Platform.getInt(baseObject, baseOffset - 4);


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17586][BUILD] Do not call static member via instance reference

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 5619f095b -> 6c67d86f2


[SPARK-17586][BUILD] Do not call static member via instance reference

## What changes were proposed in this pull request?

This PR fixes a warning message as below:

```
[WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method 
should be qualified by type name, TaskMemoryManager, instead of by an expression
[WARNING]   currentPageNumber = 
memoryManager.decodePageNumber(recordPointer)
```

by referencing the static member via class not instance reference.

## How was this patch tested?

Existing tests should cover this - Jenkins tests.

Author: hyukjinkwon 

Closes #15141 from HyukjinKwon/SPARK-17586.

(cherry picked from commit 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c67d86f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c67d86f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c67d86f

Branch: refs/heads/branch-2.0
Commit: 6c67d86f2f0a24764146827ec5c42969194cb11d
Parents: 5619f09
Author: hyukjinkwon 
Authored: Sun Sep 18 19:18:49 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 19:18:59 2016 +0100

--
 .../spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6c67d86f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
index 9710529..b517371 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
@@ -258,7 +258,7 @@ public final class UnsafeInMemorySorter {
 public void loadNext() {
   // This pointer points to a 4-byte record length, followed by the 
record's bytes
   final long recordPointer = array.get(offset + position);
-  currentPageNumber = memoryManager.decodePageNumber(recordPointer);
+  currentPageNumber = TaskMemoryManager.decodePageNumber(recordPointer);
   baseObject = memoryManager.getPage(recordPointer);
   baseOffset = memoryManager.getOffsetInPage(recordPointer) + 4;  // Skip 
over record length
   recordLength = Platform.getInt(baseObject, baseOffset - 4);


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 7151011b3 -> 1dbb725db


[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly

## Problem

CSV in Spark 2.0.0:
-  does not read null values back correctly for certain data types such as 
`Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for 
`StringType` -- this is compatible with 1.6 but leads to problems like 
SPARK-16903.

## What changes were proposed in this pull request?

This patch makes changes to read all empty values back as `null`s.

## How was this patch tested?

New test cases.

Author: Liwei Lin 

Closes #14118 from lw-lin/csv-cast-null.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dbb725d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dbb725d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dbb725d

Branch: refs/heads/master
Commit: 1dbb725dbef30bf7633584ce8efdb573f2d92bca
Parents: 7151011
Author: Liwei Lin 
Authored: Sun Sep 18 19:25:58 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 19:25:58 2016 +0100

--
 python/pyspark/sql/readwriter.py|   3 +-
 python/pyspark/sql/streaming.py |   3 +-
 .../org/apache/spark/sql/DataFrameReader.scala  |   3 +-
 .../datasources/csv/CSVInferSchema.scala| 108 +--
 .../spark/sql/streaming/DataStreamReader.scala  |   3 +-
 .../execution/datasources/csv/CSVSuite.scala|   2 +-
 .../datasources/csv/CSVTypeCastSuite.scala  |  54 ++
 7 files changed, 93 insertions(+), 83 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1dbb725d/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3d79e0c..a6860ef 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -329,7 +329,8 @@ class DataFrameReader(OptionUtils):
  being read should be skipped. If None 
is set, it uses
  the default value, ``false``.
 :param nullValue: sets the string representation of a null value. If 
None is set, it uses
-  the default value, empty string.
+  the default value, empty string. Since 2.0.1, this 
``nullValue`` param
+  applies to all supported types including the string 
type.
 :param nanValue: sets the string representation of a non-number value. 
If None is set, it
  uses the default value, ``NaN``.
 :param positiveInf: sets the string representation of a positive 
infinity value. If None

http://git-wip-us.apache.org/repos/asf/spark/blob/1dbb725d/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 67375f6..0136451 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -497,7 +497,8 @@ class DataStreamReader(OptionUtils):
  being read should be skipped. If None 
is set, it uses
  the default value, ``false``.
 :param nullValue: sets the string representation of a null value. If 
None is set, it uses
-  the default value, empty string.
+  the default value, empty string. Since 2.0.1, this 
``nullValue`` param
+  applies to all supported types including the string 
type.
 :param nanValue: sets the string representation of a non-number value. 
If None is set, it
  uses the default value, ``NaN``.
 :param positiveInf: sets the string representation of a positive 
infinity value. If None

http://git-wip-us.apache.org/repos/asf/spark/blob/1dbb725d/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index d29d90c..30f39c7 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -376,7 +376,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* from values being read should be skipped.
* `ignoreTrailingWhiteSpace` (default `false`): defines whether or not 
trailing
* whitespaces from values b

spark git commit: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly

2016-09-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 6c67d86f2 -> 151f808a1


[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly

## Problem

CSV in Spark 2.0.0:
-  does not read null values back correctly for certain data types such as 
`Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for 
`StringType` -- this is compatible with 1.6 but leads to problems like 
SPARK-16903.

## What changes were proposed in this pull request?

This patch makes changes to read all empty values back as `null`s.

## How was this patch tested?

New test cases.

Author: Liwei Lin 

Closes #14118 from lw-lin/csv-cast-null.

(cherry picked from commit 1dbb725dbef30bf7633584ce8efdb573f2d92bca)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/151f808a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/151f808a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/151f808a

Branch: refs/heads/branch-2.0
Commit: 151f808a181333daa6300c7d5d7c49c3cec3307c
Parents: 6c67d86
Author: Liwei Lin 
Authored: Sun Sep 18 19:25:58 2016 +0100
Committer: Sean Owen 
Committed: Sun Sep 18 19:26:08 2016 +0100

--
 python/pyspark/sql/readwriter.py|   3 +-
 python/pyspark/sql/streaming.py |   3 +-
 .../org/apache/spark/sql/DataFrameReader.scala  |   3 +-
 .../datasources/csv/CSVInferSchema.scala| 108 +--
 .../spark/sql/streaming/DataStreamReader.scala  |   3 +-
 .../execution/datasources/csv/CSVSuite.scala|   2 +-
 .../datasources/csv/CSVTypeCastSuite.scala  |  54 ++
 7 files changed, 93 insertions(+), 83 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/151f808a/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3da6f49..dc13a81 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -327,7 +327,8 @@ class DataFrameReader(OptionUtils):
  being read should be skipped. If None 
is set, it uses
  the default value, ``false``.
 :param nullValue: sets the string representation of a null value. If 
None is set, it uses
-  the default value, empty string.
+  the default value, empty string. Since 2.0.1, this 
``nullValue`` param
+  applies to all supported types including the string 
type.
 :param nanValue: sets the string representation of a non-number value. 
If None is set, it
  uses the default value, ``NaN``.
 :param positiveInf: sets the string representation of a positive 
infinity value. If None

http://git-wip-us.apache.org/repos/asf/spark/blob/151f808a/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 9487f9d..38c19e2 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -495,7 +495,8 @@ class DataStreamReader(OptionUtils):
  being read should be skipped. If None 
is set, it uses
  the default value, ``false``.
 :param nullValue: sets the string representation of a null value. If 
None is set, it uses
-  the default value, empty string.
+  the default value, empty string. Since 2.0.1, this 
``nullValue`` param
+  applies to all supported types including the string 
type.
 :param nanValue: sets the string representation of a non-number value. 
If None is set, it
  uses the default value, ``NaN``.
 :param positiveInf: sets the string representation of a positive 
infinity value. If None

http://git-wip-us.apache.org/repos/asf/spark/blob/151f808a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 410cb20..fe3da25 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -377,7 +377,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* from values being read should be skipped.
* `ignor

spark git commit: [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar

2016-09-19 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 8f0c35a4d -> d720a4019


[SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not 
relative to a calendar

## What changes were proposed in this pull request?

Clarify that slide and window duration are absolute, and not relative to a 
calendar.

## How was this patch tested?

Doc build (no functional change)

Author: Sean Owen 

Closes #15142 from srowen/SPARK-17297.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d720a401
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d720a401
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d720a401

Branch: refs/heads/master
Commit: d720a4019460b6c284d0473249303c349df60a1f
Parents: 8f0c35a
Author: Sean Owen 
Authored: Mon Sep 19 09:38:25 2016 +0100
Committer: Sean Owen 
Committed: Mon Sep 19 09:38:25 2016 +0100

--
 R/pkg/R/functions.R  |  8 ++--
 .../main/scala/org/apache/spark/sql/functions.scala  | 15 +++
 2 files changed, 17 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d720a401/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index ceedbe7..4d94b4c 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -2713,11 +2713,15 @@ setMethod("from_unixtime", signature(x = "Column"),
 #' @param x a time Column. Must be of TimestampType.
 #' @param windowDuration a string specifying the width of the window, e.g. '1 
second',
 #'   '1 day 12 hours', '2 minutes'. Valid interval strings 
are 'week',
-#'   'day', 'hour', 'minute', 'second', 'millisecond', 
'microsecond'.
+#'   'day', 'hour', 'minute', 'second', 'millisecond', 
'microsecond'. Note that
+#'   the duration is a fixed length of time, and does not 
vary over time
+#'   according to a calendar. For example, '1 day' always 
means 86,400,000
+#'   milliseconds, not a calendar day.
 #' @param slideDuration a string specifying the sliding interval of the 
window. Same format as
 #'  \code{windowDuration}. A new window will be generated 
every
 #'  \code{slideDuration}. Must be less than or equal to
-#'  the \code{windowDuration}.
+#'  the \code{windowDuration}. This duration is likewise 
absolute, and does not
+#'  vary according to a calendar.
 #' @param startTime the offset with respect to 1970-01-01 00:00:00 UTC with 
which to start
 #'  window intervals. For example, in order to have hourly 
tumbling windows
 #'  that start 15 minutes past the hour, e.g. 12:15-13:15, 
13:15-14:15... provide

http://git-wip-us.apache.org/repos/asf/spark/blob/d720a401/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 18e736a..960c87f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -2606,12 +2606,15 @@ object functions {
*   The time column must be of TimestampType.
* @param windowDuration A string specifying the width of the window, e.g. 
`10 minutes`,
*   `1 second`. Check 
[[org.apache.spark.unsafe.types.CalendarInterval]] for
-   *   valid duration identifiers.
+   *   valid duration identifiers. Note that the duration 
is a fixed length of
+   *   time, and does not vary over time according to a 
calendar. For example,
+   *   `1 day` always means 86,400,000 milliseconds, not a 
calendar day.
* @param slideDuration A string specifying the sliding interval of the 
window, e.g. `1 minute`.
*  A new window will be generated every 
`slideDuration`. Must be less than
*  or equal to the `windowDuration`. Check
*  [[org.apache.spark.unsafe.types.CalendarInterval]] 
for valid duration
-   *  identifiers.
+   *  identifiers. This duration is likewise absolute, and 
does not vary
+*

spark git commit: [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar

2016-09-19 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 27ce39cf2 -> ac060397c


[SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not 
relative to a calendar

## What changes were proposed in this pull request?

Clarify that slide and window duration are absolute, and not relative to a 
calendar.

## How was this patch tested?

Doc build (no functional change)

Author: Sean Owen 

Closes #15142 from srowen/SPARK-17297.

(cherry picked from commit d720a4019460b6c284d0473249303c349df60a1f)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac060397
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac060397
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac060397

Branch: refs/heads/branch-2.0
Commit: ac060397c109158e84a2b57355c8dee5ab24ab7b
Parents: 27ce39c
Author: Sean Owen 
Authored: Mon Sep 19 09:38:25 2016 +0100
Committer: Sean Owen 
Committed: Mon Sep 19 09:38:36 2016 +0100

--
 R/pkg/R/functions.R  |  8 ++--
 .../main/scala/org/apache/spark/sql/functions.scala  | 15 +++
 2 files changed, 17 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac060397/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index ceedbe7..4d94b4c 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -2713,11 +2713,15 @@ setMethod("from_unixtime", signature(x = "Column"),
 #' @param x a time Column. Must be of TimestampType.
 #' @param windowDuration a string specifying the width of the window, e.g. '1 
second',
 #'   '1 day 12 hours', '2 minutes'. Valid interval strings 
are 'week',
-#'   'day', 'hour', 'minute', 'second', 'millisecond', 
'microsecond'.
+#'   'day', 'hour', 'minute', 'second', 'millisecond', 
'microsecond'. Note that
+#'   the duration is a fixed length of time, and does not 
vary over time
+#'   according to a calendar. For example, '1 day' always 
means 86,400,000
+#'   milliseconds, not a calendar day.
 #' @param slideDuration a string specifying the sliding interval of the 
window. Same format as
 #'  \code{windowDuration}. A new window will be generated 
every
 #'  \code{slideDuration}. Must be less than or equal to
-#'  the \code{windowDuration}.
+#'  the \code{windowDuration}. This duration is likewise 
absolute, and does not
+#'  vary according to a calendar.
 #' @param startTime the offset with respect to 1970-01-01 00:00:00 UTC with 
which to start
 #'  window intervals. For example, in order to have hourly 
tumbling windows
 #'  that start 15 minutes past the hour, e.g. 12:15-13:15, 
13:15-14:15... provide

http://git-wip-us.apache.org/repos/asf/spark/blob/ac060397/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 4e185b8..eb504c8 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -2596,12 +2596,15 @@ object functions {
*   The time column must be of TimestampType.
* @param windowDuration A string specifying the width of the window, e.g. 
`10 minutes`,
*   `1 second`. Check 
[[org.apache.spark.unsafe.types.CalendarInterval]] for
-   *   valid duration identifiers.
+   *   valid duration identifiers. Note that the duration 
is a fixed length of
+   *   time, and does not vary over time according to a 
calendar. For example,
+   *   `1 day` always means 86,400,000 milliseconds, not a 
calendar day.
* @param slideDuration A string specifying the sliding interval of the 
window, e.g. `1 minute`.
*  A new window will be generated every 
`slideDuration`. Must be less than
*  or equal to the `windowDuration`. Check
*  [[org.apache.spark.unsafe.types.CalendarInterval]] 
for valid duration
-   *  identi

spark git commit: [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspark.SparkContext

2016-09-20 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f039d964d -> 4a426ff8a


[SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspark.SparkContext

## What changes were proposed in this pull request?

The Scala version of `SparkContext` has a handy field called `uiWebUrl` that 
tells you which URL the SparkUI spawned by that instance lives at. This is 
often very useful because the value for `spark.ui.port` in the config is only a 
suggestion; if that port number is taken by another Spark instance on the same 
machine, Spark will just keep incrementing the port until it finds a free one. 
So, on a machine with a lot of running PySpark instances, you often have to 
start trying all of them one-by-one until you find your application name.

Scala users have a way around this with `uiWebUrl` but Java and Python users do 
not. This pull request fixes this in the most straightforward way possible, 
simply propagating this field through the `JavaSparkContext` and into pyspark 
through the Java gateway.

Please let me know if any additional documentation/testing is needed.

## How was this patch tested?

Existing tests were run to make sure there were no regressions, and a binary 
distribution was created and tested manually for the correct value of 
`sc.uiWebPort` in a variety of circumstances.

Author: Adrian Petrescu 

Closes #15000 from apetresc/pyspark-uiweburl.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4a426ff8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4a426ff8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4a426ff8

Branch: refs/heads/master
Commit: 4a426ff8aea4faa31a3016a453dec5b7954578dd
Parents: f039d96
Author: Adrian Petrescu 
Authored: Tue Sep 20 10:49:02 2016 +0100
Committer: Sean Owen 
Committed: Tue Sep 20 10:49:02 2016 +0100

--
 python/pyspark/context.py | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4a426ff8/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 2744bb9..5c32f8e 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -333,6 +333,11 @@ class SparkContext(object):
 return self._jsc.sc().applicationId()
 
 @property
+def uiWebUrl(self):
+"""Return the URL of the SparkUI instance started by this 
SparkContext"""
+return self._jsc.sc().uiWebUrl().get()
+
+@property
 def startTime(self):
 """Return the epoch time when the Spark Context was started."""
 return self._jsc.startTime()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Add Israel Spark meetup to community page per request. Use https for meetup while we're here. Pick up a recent change to paper hyperlink reflected only in markdown, not HTML

2016-09-21 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site eee58685c -> 7c96b646e


Add Israel Spark meetup to community page per request. Use https for meetup 
while we're here. Pick up a recent change to paper hyperlink reflected only in 
markdown, not HTML


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7c96b646
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7c96b646
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7c96b646

Branch: refs/heads/asf-site
Commit: 7c96b646eb2de2dbe6aec91a82d86699e13c59c5
Parents: eee5868
Author: Sean Owen 
Authored: Wed Sep 21 08:32:16 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 08:32:16 2016 +0100

--
 community.md| 57 +---
 site/community.html | 57 +---
 site/research.html  |  2 +-
 3 files changed, 61 insertions(+), 55 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/7c96b646/community.md
--
diff --git a/community.md b/community.md
index d856409..b0c5b3a 100644
--- a/community.md
+++ b/community.md
@@ -56,84 +56,87 @@ navigation:
 Spark Meetups are grass-roots events organized and hosted by leaders and 
champions in the community around the world. Check out http://spark.meetup.com";>http://spark.meetup.com to find a Spark 
meetup in your part of the world. Below is a partial list of Spark meetups.
 
   
-http://www.meetup.com/spark-users/";>Bay Area Spark Meetup.
+https://www.meetup.com/spark-users/";>Bay Area Spark Meetup.
 This group has been running since January 2012 in the San Francisco area.
-The meetup page also contains an http://www.meetup.com/spark-users/events/past/";>archive of past 
meetups, including videos and http://www.meetup.com/spark-users/files/";>slides for most of the 
recent talks.
+The meetup page also contains an https://www.meetup.com/spark-users/events/past/";>archive of past 
meetups, including videos and https://www.meetup.com/spark-users/files/";>slides for most of the 
recent talks.
   
   
-http://www.meetup.com/Spark-Barcelona/";>Barcelona Spark Meetup
+https://www.meetup.com/Spark-Barcelona/";>Barcelona Spark 
Meetup
   
   
-http://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark 
Meetup
+https://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark 
Meetup
   
   
-http://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark 
Meetup
+https://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark 
Meetup
   
   
-http://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark 
Meetup
+https://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark 
Meetup
   
   
-http://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston 
Spark Meetup
+https://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston 
Spark Meetup
   
   
-http://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark 
Meetup
+https://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark 
Meetup
   
   
-http://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark 
Users
+https://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark 
Users
   
   
-http://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch 
Apache Spark Meetup
+https://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch 
Apache Spark Meetup
   
   
-http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati Apache 
Spark Meetup
+https://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati 
Apache Spark Meetup
   
   
-http://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou 
Spark Meetup
+https://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou 
Spark Meetup
   
   
-http://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad 
Spark Meetup
+https://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad 
Spark Meetup
   
   
-http://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana 
Spark Meetup
+https://www.meetup.com/israel-spark-users/";>Israel Spark Users
   
   
-http://www.meetup.com/Spark-London/";>London Spark Meetup
+https://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana 
Spark Meetup
   
   
-http://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark 
Meetup
+https://www.meetup.com/Spark-London/";>London Spark Meetup
   
   
-http://www.meetup.com/Mumbai-Spark-Meetup/";>Mumbai Spark 
Meetup
+https://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark 
Meetup
   
   
-http://www.meetup.com/Apache-Spark-in-Moscow/";>Moscow Spark 
Meetup
+https://www.meetup.com/Mumbai-

spark git commit: [CORE][DOC] Fix errors in comments

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master e48ebc4e4 -> 61876a427


[CORE][DOC] Fix errors in comments

## What changes were proposed in this pull request?
While reading source code of CORE and SQL core, I found some minor errors in 
comments such as extra space, missing blank line and grammar error.

I fixed these minor errors and might find more during my source code study.

## How was this patch tested?
Manually build

Author: wm...@hotmail.com 

Closes #15151 from wangmiao1981/mem.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/61876a42
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/61876a42
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/61876a42

Branch: refs/heads/master
Commit: 61876a42793bde0da90f54b44255148ed54b7f61
Parents: e48ebc4
Author: wm...@hotmail.com 
Authored: Wed Sep 21 09:33:29 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 09:33:29 2016 +0100

--
 core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +-
 sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
index cae7c9e..f255f5b 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
@@ -28,7 +28,7 @@ import org.apache.spark.util.Utils
  * :: DeveloperApi ::
  * This class represent an unique identifier for a BlockManager.
  *
- * The first 2 constructors of this class is made private to ensure that 
BlockManagerId objects
+ * The first 2 constructors of this class are made private to ensure that 
BlockManagerId objects
  * can be created only using the apply method in the companion object. This 
allows de-duplication
  * of ID objects. Also, constructor parameters are private to ensure that 
parameters cannot be
  * modified from outside this class.

http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index 0f6292d..6d7ac0f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -937,7 +937,7 @@ object SparkSession {
   }
 
   /**
-   * Return true if Hive classes can be loaded, otherwise false.
+   * @return true if Hive classes can be loaded, otherwise false.
*/
   private[spark] def hiveClassesArePresent: Boolean = {
 try {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master d3b886976 -> 7654385f2


[SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in 
Word2VecModel

## What changes were proposed in this pull request?

The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with 
the highest similarity to the query vector currently sorts the collection of 
similarities for every vocabulary element. This involves making multiple copies 
of the collection of similarities while doing a (relatively) expensive sort. It 
would be more efficient to find the best matches by maintaining a bounded 
priority queue and populating it with a single pass over the vocabulary, and 
that is exactly what this patch does.

## How was this patch tested?

This patch adds no user-visible functionality and its correctness should be 
exercised by existing tests.  To ensure that this approach is actually faster, 
I made a microbenchmark for `findSynonyms`:

```
object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new 
SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
  model.findSynonyms(key, 5)
  model.findSynonyms(key, 10)
  model.findSynonyms(key, 25)
  model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
  }
}
```

I ran this test on a model generated from the complete works of Jane Austen and 
found that the new approach was over 3x faster than the old approach.  (If the 
`num` argument to `findSynonyms` is very close to the vocabulary size, the new 
approach will have less of an advantage over the old one.)

Author: William Benton 

Closes #15150 from willb/SPARK-17595.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7654385f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7654385f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7654385f

Branch: refs/heads/master
Commit: 7654385f268a3f481c4574ce47a19ab21155efd5
Parents: d3b8869
Author: William Benton 
Authored: Wed Sep 21 09:45:06 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 09:45:06 2016 +0100

--
 .../org/apache/spark/mllib/feature/Word2Vec.scala  | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7654385f/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 42ca966..2364d43 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -35,6 +35,7 @@ import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.util.{Loader, Saveable}
 import org.apache.spark.rdd._
 import org.apache.spark.sql.SparkSession
+import org.apache.spark.util.BoundedPriorityQueue
 import org.apache.spark.util.Utils
 import org.apache.spark.util.random.XORShiftRandom
 
@@ -555,7 +556,7 @@ class Word2VecModel private[spark] (
   num: Int,
   wordOpt: Option[String]): Array[(String, Double)] = {
 require(num > 0, "Number of similar words should > 0")
-// TODO: optimize top-k
+
 val fVector = vector.toArray.map(_.toFloat)
 val cosineVec = Array.fill[Float](numWords)(0)
 val alpha: Float = 1
@@ -580,10 +581,16 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+for(i <- cosVec.indices) {
+  pq += Tuple2(wordList(i), cosVec(i))
+}
+
+val scored = pq.toSeq.sortBy(-_._2)
 
 val filtered = wordOpt match {
-  case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)
+  case Some(w) => scored.filter(tup => w != tup._1)
   case None => scored
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 28fafa3ee -> b366f1849


[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate 
(FPR) test

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests. False Positive Rate (FPR) is a popular univariate 
statistical test for feature selection. We add a chiSquare Selector based on 
False Positive Rate (FPR) test in this PR, like it is implemented in 
scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

## How was this patch tested?

Add Scala ut

Author: Peng, Meng 

Closes #14597 from mpjlu/fprChiSquare.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b366f184
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b366f184
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b366f184

Branch: refs/heads/master
Commit: b366f18496e1ce8bd20fe58a0245ef7d91819a03
Parents: 28fafa3
Author: Peng, Meng 
Authored: Wed Sep 21 10:17:38 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:17:38 2016 +0100

--
 .../apache/spark/ml/feature/ChiSqSelector.scala |  69 -
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  28 -
 .../spark/mllib/feature/ChiSqSelector.scala | 103 ++-
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |  11 +-
 .../mllib/feature/ChiSqSelectorSuite.scala  |  18 
 project/MimaExcludes.scala  |   3 +
 python/pyspark/mllib/feature.py |  71 -
 7 files changed, 262 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b366f184/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 1482eb3..0c6a37b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -27,6 +27,7 @@ import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
 import org.apache.spark.mllib.feature
+import org.apache.spark.mllib.feature.ChiSqSelectorType
 import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
 import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
 import org.apache.spark.rdd.RDD
@@ -54,11 +55,47 @@ private[feature] trait ChiSqSelectorParams extends Params
 
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
+
+  final val percentile = new DoubleParam(this, "percentile",
+"Percentile of features that selector will select, ordered by statistics 
value descending.",
+ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  def getPercentile: Double = $(percentile)
+
+  final val alpha = new DoubleParam(this, "alpha",
+"The highest p-value for features to be kept.",
+ParamValidators.inRange(0, 1))
+  setDefault(alpha -> 0.05)
+
+  /** @group getParam */
+  def getAlpha: Double = $(alpha)
+
+  /**
+   * The ChiSqSelector supports KBest, Percentile, FPR selection,
+   * which is the same as ChiSqSelectorType defined in MLLIB.
+   * when call setNumTopFeatures, the selectorType is set to KBest
+   * when call setPercentile, the selectorType is set to Percentile
+   * when call setAlpha, the selectorType is set to FPR
+   */
+  final val selectorType = new Param[String](this, "selectorType",
+"ChiSqSelector Type: KBest, Percentile, FPR")
+  setDefault(selectorType -> ChiSqSelectorType.KBest.toString)
+
+  /** @group getParam */
+  def getChiSqSelectorType: String = $(selectorType)
 }
 
 /**
  * Chi-Squared feature selection, which selects categorical features to use 
for predicting a
  * categorical label.
+ * The selector supports three selection methods: `KBest`, `Percentile` and 
`FPR`.
+ * `KBest` chooses the `k` top features according to a chi-squared test.
+ * `Percentile` is similar but chooses a fraction of all features instead of a 
fixed number.
+ * `FPR` chooses all features whose false positive rate meets some threshold.
+ * By default, the selection method is `KBest`, the default number of top 
features is 50.
+ * User can use setNumTopFeatures, setPercentile and setAlpha to set different 
selection methods.
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: 
String)
@@ -69,7 +106,22 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") 
override val uid: Str
 
   /** @group setParam */
   @Sin

spark git commit: [SPARK-17219][ML] Add NaN value handling in Bucketizer

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b366f1849 -> 57dc326bd


[SPARK-17219][ML] Add NaN value handling in Bucketizer

## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing 
NaN value.
Sometimes, null value might also be useful to users, so in these cases, 
Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal 
exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh 

Author: VinceShieh 

Closes #14858 from VinceShieh/spark-17219.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57dc326b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57dc326b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57dc326b

Branch: refs/heads/master
Commit: 57dc326bd00cf0a49da971e9c573c48ae28acaa2
Parents: b366f18
Author: VinceShieh 
Authored: Wed Sep 21 10:20:57 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:20:57 2016 +0100

--
 docs/ml-features.md |  6 +++-
 .../apache/spark/ml/feature/Bucketizer.scala| 13 +---
 .../spark/ml/feature/QuantileDiscretizer.scala  |  9 --
 .../spark/ml/feature/BucketizerSuite.scala  | 31 
 .../ml/feature/QuantileDiscretizerSuite.scala   | 29 +++---
 python/pyspark/ml/feature.py|  5 
 .../spark/sql/DataFrameStatFunctions.scala  |  4 ++-
 7 files changed, 85 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 746593f..a39b31c 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1102,7 +1102,11 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned
-categorical features. The number of bins is set by the `numBuckets` parameter.
+categorical features. The number of bins is set by the `numBuckets` parameter. 
It is possible
+that the number of buckets used will be less than this value, for example, if 
there are too few
+distinct values of the input to create enough distinct quantiles. Note also 
that NaN values are
+handled specially and placed into their own bucket. For example, if 4 buckets 
are used, then
+non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
 The bin ranges are chosen using an approximate algorithm (see the 
documentation for
 
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions)
 for a
 detailed description). The precision of the approximation can be controlled 
with the

http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
index 100d9e7..ec0ea05 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
@@ -106,7 +106,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 @Since("1.6.0")
 object Bucketizer extends DefaultParamsReadable[Bucketizer] {
 
-  /** We require splits to be of length >= 3 and to be in strictly increasing 
order. */
+  /**
+   * We require splits to be of length >= 3 and to be in strictly increasing 
order.
+   * No NaN split should be accepted.
+   */
   private[feature] def checkSplits(splits: Array[Double]): Boolean = {
 if (splits.length < 3) {
   false
@@ -114,10 +117,10 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
   var i = 0
   val n = splits.length - 1
   while (i < n) {
-if (splits(i) >= splits(i + 1)) return false
+if (splits(i) >= splits(i + 1) || splits(i).isNaN) return false
 i += 1
   }
-  true
+  !splits(n).isNaN
 }
   }
 
@@ -126,7 +129,9 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] 
{
* @throws SparkException if a feature is < splits.head or > splits.last
*/
   private[feature] def binarySearchForBuckets(splits: Array[Double], feature: 
Double): Double = {
-if (feature == splits.last) {
+if (feature.isNaN) {
+  splits.length - 1
+} else if (feature == splits.last) {
   splits.length - 2

spark git commit: [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 57dc326bd -> 25a020be9


[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding 
buffer as default for maxCharsPerColumn option in CSV

## What changes were proposed in this pull request?

This PR includes the changes below:

1. Upgrade Univocity library from 2.1.1 to 2.2.1

  This includes some performance improvement and also enabling auto-extending 
buffer in `maxCharsPerColumn` option in CSV. Please refer the [release 
notes](https://github.com/uniVocity/univocity-parsers/releases).

2. Remove useless `rowSeparator` variable existing in `CSVOptions`

  We have this unused variable in 
[CSVOptions.scala#L127](https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127)
 but it seems possibly causing confusion that it actually does not care of 
`\r\n`. For example, we have an issue open about this, 
[SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing 
this variable.

  This variable is virtually not being used because we rely on 
`LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.

3. Set the default value of `maxCharsPerColumn` to auto-expending.

  We are setting 100 for the length of each column. It'd be more sensible 
we allow auto-expending rather than fixed length by default.

  To make sure, using `-1` is being described in the release note, 
[2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #15138 from HyukjinKwon/SPARK-17583.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25a020be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25a020be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25a020be

Branch: refs/heads/master
Commit: 25a020be99b6a540e4001e59e40d5d1c8aa53812
Parents: 57dc326
Author: hyukjinkwon 
Authored: Wed Sep 21 10:35:29 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:35:29 2016 +0100

--
 dev/deps/spark-deps-hadoop-2.2   | 2 +-
 dev/deps/spark-deps-hadoop-2.3   | 2 +-
 dev/deps/spark-deps-hadoop-2.4   | 2 +-
 dev/deps/spark-deps-hadoop-2.6   | 2 +-
 dev/deps/spark-deps-hadoop-2.7   | 2 +-
 python/pyspark/sql/readwriter.py | 2 +-
 python/pyspark/sql/streaming.py  | 2 +-
 sql/core/pom.xml | 2 +-
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala| 4 ++--
 .../apache/spark/sql/execution/datasources/csv/CSVOptions.scala  | 4 +---
 .../apache/spark/sql/execution/datasources/csv/CSVParser.scala   | 2 --
 .../scala/org/apache/spark/sql/streaming/DataStreamReader.scala  | 4 ++--
 12 files changed, 13 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.2
--
diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2
index a7259e2..f4f92c6 100644
--- a/dev/deps/spark-deps-hadoop-2.2
+++ b/dev/deps/spark-deps-hadoop-2.2
@@ -159,7 +159,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.3
--
diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3
index 6986ab5..3db013f 100644
--- a/dev/deps/spark-deps-hadoop-2.3
+++ b/dev/deps/spark-deps-hadoop-2.3
@@ -167,7 +167,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.4
--
diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4
index 75cccb3..7171010 100644
--- a/dev/deps/spark-deps-hadoop-2.4
+++ b/dev/deps/spark-deps-hadoop-2.4
@@ -167,7 +167,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http

spark git commit: [CORE][MINOR] Add minor code change to TaskState and Task

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 25a020be9 -> dd7561d33


[CORE][MINOR] Add minor code change to TaskState and Task

## What changes were proposed in this pull request?
- TaskState and ExecutorState expose isFailed and isFinished functions. It can 
be useful to add test coverage for different states. Currently, Other enums do 
not expose any functions so this PR aims just these two enums.
- `private` access modifier is added for Finished Task States Set
- A minor doc change is added.

## How was this patch tested?
New Unit tests are added and run locally.

Author: erenavsarogullari 

Closes #15143 from erenavsarogullari/SPARK-17584.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd7561d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd7561d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd7561d3

Branch: refs/heads/master
Commit: dd7561d33761d119ded09cfba072147292bf6964
Parents: 25a020b
Author: erenavsarogullari 
Authored: Wed Sep 21 14:47:18 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 14:47:18 2016 +0100

--
 core/src/main/scala/org/apache/spark/TaskState.scala  | 2 +-
 core/src/main/scala/org/apache/spark/scheduler/Task.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/TaskState.scala
--
diff --git a/core/src/main/scala/org/apache/spark/TaskState.scala 
b/core/src/main/scala/org/apache/spark/TaskState.scala
index cbace7b..596ce67 100644
--- a/core/src/main/scala/org/apache/spark/TaskState.scala
+++ b/core/src/main/scala/org/apache/spark/TaskState.scala
@@ -21,7 +21,7 @@ private[spark] object TaskState extends Enumeration {
 
   val LAUNCHING, RUNNING, FINISHED, FAILED, KILLED, LOST = Value
 
-  val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST)
+  private val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST)
 
   type TaskState = Value
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/scheduler/Task.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/Task.scala 
b/core/src/main/scala/org/apache/spark/scheduler/Task.scala
index 1ed36bf..ea9dc39 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/Task.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/Task.scala
@@ -239,7 +239,7 @@ private[spark] object Task {
* and return the task itself as a serialized ByteBuffer. The caller can 
then update its
* ClassLoaders and deserialize the task.
*
-   * @return (taskFiles, taskJars, taskBytes)
+   * @return (taskFiles, taskJars, taskProps, taskBytes)
*/
   def deserializeWithDependencies(serializedTask: ByteBuffer)
 : (HashMap[String, Long], HashMap[String, Long], Properties, ByteBuffer) = 
{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 65295bad9 -> 45bccdd9c


[BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error

## What changes were proposed in this pull request?
This PR is to fix the code style errors.

## How was this patch tested?
Manual.

Before:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[153] 
(sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[196] 
(sizes) LineLength: Line is longer than 100 characters (found 108).
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[239] 
(sizes) LineLength: Line is longer than 100 characters (found 115).
[ERROR] 
src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119]
 (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] 
src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129]
 (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] 
src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] 
(modifier) ModifierOrder: 'static' modifier out of order with the JLS 
suggestions.
[ERROR] 
src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[184] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] 
src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[304] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
 ```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Weiqing Yang 

Closes #15175 from Sherry302/javastylefix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45bccdd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45bccdd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45bccdd9

Branch: refs/heads/branch-2.0
Commit: 45bccdd9c2b180323958db0f92ca8ee591e502ef
Parents: 65295ba
Author: Weiqing Yang 
Authored: Wed Sep 21 15:18:02 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 15:18:02 2016 +0100

--
 .../org/apache/spark/network/client/TransportClient.java | 11 ++-
 .../spark/network/server/TransportRequestHandler.java|  7 ---
 .../org/apache/spark/network/util/LevelDBProvider.java   |  2 +-
 .../apache/spark/network/yarn/YarnShuffleService.java|  4 ++--
 4 files changed, 13 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45bccdd9/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
index a67683b..17ac91d 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
@@ -150,8 +150,8 @@ public class TransportClient implements Closeable {
   if (future.isSuccess()) {
 long timeTaken = System.currentTimeMillis() - startTime;
 if (logger.isTraceEnabled()) {
-  logger.trace("Sending request {} to {} took {} ms", 
streamChunkId, getRemoteAddress(channel),
-timeTaken);
+  logger.trace("Sending request {} to {} took {} ms", 
streamChunkId,
+getRemoteAddress(channel), timeTaken);
 }
   } else {
 String errorMsg = String.format("Failed to send request %s to %s: 
%s", streamChunkId,
@@ -193,8 +193,8 @@ public class TransportClient implements Closeable {
 if (future.isSuccess()) {
   long timeTaken = System.currentTimeMillis() - startTime;
   if (logger.isTraceEnabled()) {
-logger.trace("Sending request for {} to {} took {} ms", 
streamId, getRemoteAddress(channel),
-  timeTaken);
+logger.trace("Sending request for {} to {} took {} ms", 
streamId,
+  getRemoteAddress(channel), timeTaken);
   }
 } else {
   String errorMsg = String.format("Failed to send request for %s 
to %s: %s", streamId,
@@ -236,7 +236,8 @@ public class TransportClient implements Closeable {
   if (future.isSuccess()) {
 long timeTaken = System.currentTimeMillis() - startTime;
 if (logger.isTraceEnabled()) {
-  logger.trace("Sending request {} to {} took {} ms", requestId, 
getRemoteAddress(channel

spark git commit: [SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.

2016-09-22 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 e8b26be9b -> b25a8e6e1


[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.

## What changes were proposed in this pull request?

Modified the documentation to clarify that `build/mvn` and `pom.xml` always add 
Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely 
ignore warnings about `-XX:MaxPermSize` that may result from compiling or 
running tests with Java 8.

## How was this patch tested?

Rebuilt HTML documentation, made sure that building-spark.html displays 
correctly in a browser.

Author: frreiss 

Closes #15005 from frreiss/fred-17421a.

(cherry picked from commit 646f383465c123062cbcce288a127e23984c7c7f)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b25a8e6e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b25a8e6e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b25a8e6e

Branch: refs/heads/branch-2.0
Commit: b25a8e6e167717fbe92e6a9b69a8a2510bf926ca
Parents: e8b26be
Author: frreiss 
Authored: Thu Sep 22 10:31:15 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 22 10:31:28 2016 +0100

--
 docs/building-spark.md | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b25a8e6e/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 2c987cf..330df00 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -16,11 +16,13 @@ Building Spark using Maven requires Maven 3.3.9 or newer 
and Java 7+.
 
 ### Setting up Maven's Memory Usage
 
-You'll need to configure Maven to use more memory than usual by setting 
`MAVEN_OPTS`. We recommend the following settings:
+You'll need to configure Maven to use more memory than usual by setting 
`MAVEN_OPTS`:
 
-export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M 
-XX:ReservedCodeCacheSize=512m"
+export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
 
-If you don't run this, you may see errors like the following:
+When compiling with Java 7, you will need to add the additional option 
"-XX:MaxPermSize=512M" to MAVEN_OPTS.
+
+If you don't add these parameters to `MAVEN_OPTS`, you may see errors and 
warnings like the following:
 
 [INFO] Compiling 203 Scala sources and 9 Java sources to 
/Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
 [ERROR] PermGen space -> [Help 1]
@@ -28,12 +30,18 @@ If you don't run this, you may see errors like the 
following:
 [INFO] Compiling 203 Scala sources and 9 Java sources to 
/Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
 [ERROR] Java heap space -> [Help 1]
 
-You can fix this by setting the `MAVEN_OPTS` variable as discussed before.
+[INFO] Compiling 233 Scala sources and 41 Java sources to 
/Users/me/Development/spark/sql/core/target/scala-{site.SCALA_BINARY_VERSION}/classes...
+OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
disabled.
+OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
-XX:ReservedCodeCacheSize=
+
+You can fix these problems by setting the `MAVEN_OPTS` variable as discussed 
before.
 
 **Note:**
 
-* For Java 8 and above this step is not required.
-* If using `build/mvn` with no `MAVEN_OPTS` set, the script will automate this 
for you.
+* If using `build/mvn` with no `MAVEN_OPTS` set, the script will automatically 
add the above options to the `MAVEN_OPTS` environment variable.
+* The `test` phase of the Spark build will automatically add these options to 
`MAVEN_OPTS`, even when not using `build/mvn`.
+* You may see warnings like "ignoring option MaxPermSize=1g; support was 
removed in 8.0" when building or running tests with Java 8 and `build/mvn`. 
These warnings are harmless.
+
 
 ### build/mvn
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.

2016-09-22 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master de7df7def -> 646f38346


[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.

## What changes were proposed in this pull request?

Modified the documentation to clarify that `build/mvn` and `pom.xml` always add 
Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely 
ignore warnings about `-XX:MaxPermSize` that may result from compiling or 
running tests with Java 8.

## How was this patch tested?

Rebuilt HTML documentation, made sure that building-spark.html displays 
correctly in a browser.

Author: frreiss 

Closes #15005 from frreiss/fred-17421a.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/646f3834
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/646f3834
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/646f3834

Branch: refs/heads/master
Commit: 646f383465c123062cbcce288a127e23984c7c7f
Parents: de7df7d
Author: frreiss 
Authored: Thu Sep 22 10:31:15 2016 +0100
Committer: Sean Owen 
Committed: Thu Sep 22 10:31:15 2016 +0100

--
 docs/building-spark.md | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/646f3834/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 6908fc1..75c304a3 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -16,11 +16,13 @@ Building Spark using Maven requires Maven 3.3.9 or newer 
and Java 7+.
 
 ### Setting up Maven's Memory Usage
 
-You'll need to configure Maven to use more memory than usual by setting 
`MAVEN_OPTS`. We recommend the following settings:
+You'll need to configure Maven to use more memory than usual by setting 
`MAVEN_OPTS`:
 
-export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M 
-XX:ReservedCodeCacheSize=512m"
+export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
 
-If you don't run this, you may see errors like the following:
+When compiling with Java 7, you will need to add the additional option 
"-XX:MaxPermSize=512M" to MAVEN_OPTS.
+
+If you don't add these parameters to `MAVEN_OPTS`, you may see errors and 
warnings like the following:
 
 [INFO] Compiling 203 Scala sources and 9 Java sources to 
/Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
 [ERROR] PermGen space -> [Help 1]
@@ -28,12 +30,18 @@ If you don't run this, you may see errors like the 
following:
 [INFO] Compiling 203 Scala sources and 9 Java sources to 
/Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
 [ERROR] Java heap space -> [Help 1]
 
-You can fix this by setting the `MAVEN_OPTS` variable as discussed before.
+[INFO] Compiling 233 Scala sources and 41 Java sources to 
/Users/me/Development/spark/sql/core/target/scala-{site.SCALA_BINARY_VERSION}/classes...
+OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
disabled.
+OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
-XX:ReservedCodeCacheSize=
+
+You can fix these problems by setting the `MAVEN_OPTS` variable as discussed 
before.
 
 **Note:**
 
-* For Java 8 and above this step is not required.
-* If using `build/mvn` with no `MAVEN_OPTS` set, the script will automate this 
for you.
+* If using `build/mvn` with no `MAVEN_OPTS` set, the script will automatically 
add the above options to the `MAVEN_OPTS` environment variable.
+* The `test` phase of the Spark build will automatically add these options to 
`MAVEN_OPTS`, even when not using `build/mvn`.
+* You may see warnings like "ignoring option MaxPermSize=1g; support was 
removed in 8.0" when building or running tests with Java 8 and `build/mvn`. 
These warnings are harmless.
+
 
 ### build/mvn
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [BUILD] Closes some stale PRs

2016-09-23 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 62ccf27ab -> 5c5396cb4


[BUILD] Closes some stale PRs

## What changes were proposed in this pull request?

This PR proposes to close some stale PRs and ones suggested to be closed by 
committer(s)

Closes #12415
Closes #14765
Closes #15118
Closes #15184
Closes #15183
Closes #9440
Closes #15023
Closes #14643
Closes #14827

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #15198 from HyukjinKwon/stale-prs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c5396cb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c5396cb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c5396cb

Branch: refs/heads/master
Commit: 5c5396cb4725ba5ceee26ed885e8b941d219757b
Parents: 62ccf27
Author: hyukjinkwon 
Authored: Fri Sep 23 09:41:50 2016 +0100
Committer: Sean Owen 
Committed: Fri Sep 23 09:41:50 2016 +0100

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2

2016-09-23 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 5c5396cb4 -> 90d575421


[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of 
Accumulator V2

## What changes were proposed in this pull request?

Move the internals of the PySpark accumulator API from the old deprecated API 
on top of the new accumulator API.

## How was this patch tested?

The existing PySpark accumulator tests (both unit tests and doc tests at the 
start of accumulator.py).

Author: Holden Karau 

Closes #14467 from holdenk/SPARK-16861-refactor-pyspark-accumulator-api.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90d57542
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90d57542
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90d57542

Branch: refs/heads/master
Commit: 90d5754212425d55f992c939a2bc7d9ac6ef92b8
Parents: 5c5396c
Author: Holden Karau 
Authored: Fri Sep 23 09:44:30 2016 +0100
Committer: Sean Owen 
Committed: Fri Sep 23 09:44:30 2016 +0100

--
 .../org/apache/spark/api/python/PythonRDD.scala | 42 +++-
 python/pyspark/context.py   |  5 +--
 2 files changed, 25 insertions(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90d57542/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
index d841091..0ca91b9 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
@@ -20,7 +20,7 @@ package org.apache.spark.api.python
 import java.io._
 import java.net._
 import java.nio.charset.StandardCharsets
-import java.util.{ArrayList => JArrayList, Collections, List => JList, Map => 
JMap}
+import java.util.{ArrayList => JArrayList, List => JList, Map => JMap}
 
 import scala.collection.JavaConverters._
 import scala.collection.mutable
@@ -38,7 +38,7 @@ import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.input.PortableDataStream
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
-import org.apache.spark.util.{SerializableConfiguration, Utils}
+import org.apache.spark.util._
 
 
 private[spark] class PythonRDD(
@@ -75,7 +75,7 @@ private[spark] case class PythonFunction(
 pythonExec: String,
 pythonVer: String,
 broadcastVars: JList[Broadcast[PythonBroadcast]],
-accumulator: Accumulator[JList[Array[Byte]]])
+accumulator: PythonAccumulatorV2)
 
 /**
  * A wrapper for chained Python functions (from bottom to top).
@@ -200,7 +200,7 @@ private[spark] class PythonRunner(
 val updateLen = stream.readInt()
 val update = new Array[Byte](updateLen)
 stream.readFully(update)
-accumulator += Collections.singletonList(update)
+accumulator.add(update)
   }
   // Check whether the worker is ready to be re-used.
   if (stream.readInt() == SpecialLengths.END_OF_STREAM) {
@@ -461,7 +461,7 @@ private[spark] object PythonRDD extends Logging {
   JavaRDD[Array[Byte]] = {
 val file = new DataInputStream(new FileInputStream(filename))
 try {
-  val objs = new collection.mutable.ArrayBuffer[Array[Byte]]
+  val objs = new mutable.ArrayBuffer[Array[Byte]]
   try {
 while (true) {
   val length = file.readInt()
@@ -866,11 +866,13 @@ class BytesToString extends 
org.apache.spark.api.java.function.Function[Array[By
 }
 
 /**
- * Internal class that acts as an `AccumulatorParam` for Python accumulators. 
Inside, it
+ * Internal class that acts as an `AccumulatorV2` for Python accumulators. 
Inside, it
  * collects a list of pickled strings that we pass to Python through a socket.
  */
-private class PythonAccumulatorParam(@transient private val serverHost: 
String, serverPort: Int)
-  extends AccumulatorParam[JList[Array[Byte]]] {
+private[spark] class PythonAccumulatorV2(
+@transient private val serverHost: String,
+private val serverPort: Int)
+  extends CollectionAccumulator[Array[Byte]] {
 
   Utils.checkHost(serverHost, "Expected hostname")
 
@@ -880,30 +882,33 @@ private class PythonAccumulatorParam(@transient private 
val serverHost: String,
* We try to reuse a single Socket to transfer accumulator updates, as they 
are all added
* by the DAGScheduler's single-threaded RpcEndpoint anyway.
*/
-  @transient var socket: Socket = _
+  @transient private var socket: Socket = _
 
-  def openSocket(): Socket = synchronized {
+  private def openSocket(): Socket = synchronized {
 if (socket == null || socket.isClosed) {

spark git commit: [SPARK-10835][ML] Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 7c382524a -> f3fe55439


[SPARK-10835][ML] Word2Vec should accept non-null string array, in addition to 
existing null string array

## What changes were proposed in this pull request?

To match Tokenizer and for compatibility with Word2Vec, output a nullable 
string array type in NGram

## How was this patch tested?

Jenkins tests.

Author: Sean Owen 

Closes #15179 from srowen/SPARK-10835.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3fe5543
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3fe5543
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3fe5543

Branch: refs/heads/master
Commit: f3fe55439e4c865c26502487a1bccf255da33f4a
Parents: 7c38252
Author: Sean Owen 
Authored: Sat Sep 24 08:06:41 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 24 08:06:41 2016 +0100

--
 .../org/apache/spark/ml/feature/Word2Vec.scala  |  3 ++-
 .../apache/spark/ml/feature/Word2VecSuite.scala | 21 
 2 files changed, 23 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f3fe5543/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index 14c0512..d53f3df 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -108,7 +108,8 @@ private[feature] trait Word2VecBase extends Params
* Validate and transform the input schema.
*/
   protected def validateAndTransformSchema(schema: StructType): StructType = {
-SchemaUtils.checkColumnType(schema, $(inputCol), new ArrayType(StringType, 
true))
+val typeCandidates = List(new ArrayType(StringType, true), new 
ArrayType(StringType, false))
+SchemaUtils.checkColumnTypes(schema, $(inputCol), typeCandidates)
 SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f3fe5543/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
index 0b441f8..613cc3d 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
@@ -207,5 +207,26 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 val newInstance = testDefaultReadWrite(instance)
 assert(newInstance.getVectors.collect() === instance.getVectors.collect())
   }
+
+  test("Word2Vec works with input that is non-nullable (NGram)") {
+val spark = this.spark
+import spark.implicits._
+
+val sentence = "a q s t q s t b b b s t m s t m q "
+val docDF = sc.parallelize(Seq(sentence, sentence)).map(_.split(" 
")).toDF("text")
+
+val ngram = new NGram().setN(2).setInputCol("text").setOutputCol("ngrams")
+val ngramDF = ngram.transform(docDF)
+
+val model = new Word2Vec()
+  .setVectorSize(2)
+  .setInputCol("ngrams")
+  .setOutputCol("result")
+  .fit(ngramDF)
+
+// Just test that this transformation succeeds
+model.transform(ngramDF).collect()
+  }
+
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10835][ML] Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9e91a1009 -> ed545763a


[SPARK-10835][ML] Word2Vec should accept non-null string array, in addition to 
existing null string array

## What changes were proposed in this pull request?

To match Tokenizer and for compatibility with Word2Vec, output a nullable 
string array type in NGram

## How was this patch tested?

Jenkins tests.

Author: Sean Owen 

Closes #15179 from srowen/SPARK-10835.

(cherry picked from commit f3fe55439e4c865c26502487a1bccf255da33f4a)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed545763
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed545763
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed545763

Branch: refs/heads/branch-2.0
Commit: ed545763adc3f50569581c9b017b396e8997ac31
Parents: 9e91a10
Author: Sean Owen 
Authored: Sat Sep 24 08:06:41 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 24 08:06:56 2016 +0100

--
 .../org/apache/spark/ml/feature/Word2Vec.scala  |  3 ++-
 .../apache/spark/ml/feature/Word2VecSuite.scala | 21 
 2 files changed, 23 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ed545763/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index 14c0512..d53f3df 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -108,7 +108,8 @@ private[feature] trait Word2VecBase extends Params
* Validate and transform the input schema.
*/
   protected def validateAndTransformSchema(schema: StructType): StructType = {
-SchemaUtils.checkColumnType(schema, $(inputCol), new ArrayType(StringType, 
true))
+val typeCandidates = List(new ArrayType(StringType, true), new 
ArrayType(StringType, false))
+SchemaUtils.checkColumnTypes(schema, $(inputCol), typeCandidates)
 SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/ed545763/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
index 16c74f6..c8f1311 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala
@@ -207,5 +207,26 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 val newInstance = testDefaultReadWrite(instance)
 assert(newInstance.getVectors.collect() === instance.getVectors.collect())
   }
+
+  test("Word2Vec works with input that is non-nullable (NGram)") {
+val spark = this.spark
+import spark.implicits._
+
+val sentence = "a q s t q s t b b b s t m s t m q "
+val docDF = sc.parallelize(Seq(sentence, sentence)).map(_.split(" 
")).toDF("text")
+
+val ngram = new NGram().setN(2).setInputCol("text").setOutputCol("ngrams")
+val ngramDF = ngram.transform(docDF)
+
+val model = new Word2Vec()
+  .setVectorSize(2)
+  .setInputCol("ngrams")
+  .setOutputCol("result")
+  .fit(ngramDF)
+
+// Just test that this transformation succeeds
+model.transform(ngramDF).collect()
+  }
+
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most one 0

2016-09-24 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f3fe55439 -> 248916f55


[SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most 
one 0

## What changes were proposed in this pull request?

Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, 
requiring all > 0

## How was this patch tested?

Jenkins tests plus new test cases

Author: Sean Owen 

Closes #15149 from srowen/SPARK-17057.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/248916f5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/248916f5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/248916f5

Branch: refs/heads/master
Commit: 248916f5589155c0c3e93c3874781f17b08d598d
Parents: f3fe554
Author: Sean Owen 
Authored: Sat Sep 24 08:15:55 2016 +0100
Committer: Sean Owen 
Committed: Sat Sep 24 08:15:55 2016 +0100

--
 .../ml/classification/LogisticRegression.scala  |  5 +--
 .../ProbabilisticClassifier.scala   | 20 +--
 .../ml/param/shared/SharedParamsCodeGen.scala   |  8 +++--
 .../spark/ml/param/shared/sharedParams.scala|  4 +--
 .../ProbabilisticClassifierSuite.scala  | 35 
 .../pyspark/ml/param/_shared_params_code_gen.py |  5 +--
 python/pyspark/ml/param/shared.py   |  4 +--
 7 files changed, 52 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/248916f5/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 343d50c..5ab63d1 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -123,9 +123,10 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
 
   /**
* Set thresholds in multiclass (or binary) classification to adjust the 
probability of
-   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * predicting each class. Array must have length equal to the number of 
classes, with values > 0,
+   * excepting that at most one value may be 0.
* The class with largest value p/t is predicted, where p is the original 
probability of that
-   * class and t is the class' threshold.
+   * class and t is the class's threshold.
*
* Note: When [[setThresholds()]] is called, any user-set value for 
[[threshold]] will be cleared.
*   If both [[threshold]] and [[thresholds]] are set in a ParamMap, 
then they must be

http://git-wip-us.apache.org/repos/asf/spark/blob/248916f5/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
index 1b6e775..e89da6f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.ml.classification
 
 import org.apache.spark.annotation.DeveloperApi
-import org.apache.spark.ml.linalg.{DenseVector, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.linalg.{DenseVector, Vector, VectorUDT}
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util.SchemaUtils
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -200,22 +200,20 @@ abstract class ProbabilisticClassificationModel[
 if (!isDefined(thresholds)) {
   probability.argmax
 } else {
-  val thresholds: Array[Double] = getThresholds
-  val probabilities = probability.toArray
+  val thresholds = getThresholds
   var argMax = 0
   var max = Double.NegativeInfinity
   var i = 0
   val probabilitySize = probability.size
   while (i < probabilitySize) {
-if (thresholds(i) == 0.0) {
-  max = Double.PositiveInfinity
+// Thresholds are all > 0, excepting that at most one may be 0.
+// The single class whose threshold is 0, if any, will always be 
predicted
+// ('scaled' = +Infinity). However in the case that this class also has
+// 0 probability, the class will not be selected ('scaled' is NaN).
+val scaled = probability(i) / thresholds(i)
+if (scaled &

spark git commit: [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API.

2016-09-26 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 59d87d240 -> ac65139be


[SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API.

## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, 
however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching 
them by using different setting function, since the setting order will involves 
some unexpected issue. For example, if users both set ```numTopFeatures``` and 
```percentile```, it will train ```kbest``` or ```percentile``` model based on 
the order of setting (the latter setting one will be trained). This make users 
confused, and we should allow users to set selector type explicitly. We handle 
similar issues at other place of ML code base such as 
```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set 
for ```fpr``` model, we can not handle it elegantly in the existing framework. 
And similar issues for ```kbest``` and ```percentile``` model. Setting selector 
type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle 
param interaction such as if users set ```selectorType = percentile``` and 
```alpha = 0.1```, we should notify users the parameter ```alpha``` will take 
no effect. We should handle complex parameter interaction checks at 
```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib 
convention.
* Add ML Python API.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #15214 from yanboliang/spark-17017.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac65139b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac65139b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac65139b

Branch: refs/heads/master
Commit: ac65139be96dbf87402b9a85729a93afd3c6ff17
Parents: 59d87d2
Author: Yanbo Liang 
Authored: Mon Sep 26 09:45:33 2016 +0100
Committer: Sean Owen 
Committed: Mon Sep 26 09:45:33 2016 +0100

--
 .../apache/spark/ml/feature/ChiSqSelector.scala | 86 +++-
 .../spark/mllib/api/python/PythonMLLibAPI.scala | 38 +++--
 .../spark/mllib/feature/ChiSqSelector.scala | 51 +++-
 .../spark/ml/feature/ChiSqSelectorSuite.scala   | 27 --
 .../mllib/feature/ChiSqSelectorSuite.scala  |  2 +-
 python/pyspark/ml/feature.py| 71 ++--
 python/pyspark/mllib/feature.py | 59 +++---
 7 files changed, 206 insertions(+), 128 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac65139b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 0c6a37b..9c131a4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -27,7 +27,7 @@ import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
 import org.apache.spark.mllib.feature
-import org.apache.spark.mllib.feature.ChiSqSelectorType
+import org.apache.spark.mllib.feature.{ChiSqSelector => OldChiSqSelector}
 import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
 import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
 import org.apache.spark.rdd.RDD
@@ -44,7 +44,9 @@ private[feature] trait ChiSqSelectorParams extends Params
   /**
* Number of features that selector will select (ordered by statistic value 
descending). If the
* number of features is less than numTopFeatures, then this will select all 
features.
+   * Only applicable when selectorType = "kbest".
* The default value of numTopFeatures is 50.
+   *
* @group param
*/
   final val numTopFeatures = new IntParam(this, "numTopFeatures",
@@ -56,6 +58,11 @@ private[feature] trait ChiSqSelectorParams extends Params
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
 
+  /**
+   * Percentile of features that selector will select, ordered by statistics 
value descending.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   */
   final val percentile = new DoubleParam(this, "percentile",
 "Percentile of features that selector will select, ordered by statistics 
value descending.",
 ParamValidators.inRange(0, 1))
@@ -64,8 +71,12 @@ private[feature] trait ChiSqSelectorParams

spark git commit: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc

2016-09-26 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master ac65139be -> 50b89d05b


[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc

## What changes were proposed in this pull request?

This change modifies the implementation of DataFrameWriter.save such that it 
works with jdbc, and the call to jdbc merely delegates to save.

## How was this patch tested?

This was tested via unit tests in the JDBCWriteSuite, of which I added one new 
test to cover this scenario.

## Additional details

rxin This seems to have been most recently touched by you and was also 
commented on in the JIRA.

This contribution is my original work and I license the work to the project 
under the project's open source license.

Author: Justin Pihony 
Author: Justin Pihony 

Closes #12601 from JustinPihony/jdbc_reconciliation.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/50b89d05
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/50b89d05
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/50b89d05

Branch: refs/heads/master
Commit: 50b89d05b7bffc212cc9b9ae6e0bca7cb90b9c77
Parents: ac65139
Author: Justin Pihony 
Authored: Mon Sep 26 09:54:22 2016 +0100
Committer: Sean Owen 
Committed: Mon Sep 26 09:54:22 2016 +0100

--
 docs/sql-programming-guide.md   |  6 +-
 .../examples/sql/JavaSQLDataSourceExample.java  | 21 +
 examples/src/main/python/sql/datasource.py  | 19 
 examples/src/main/r/RSparkSQLExample.R  |  4 +
 .../examples/sql/SQLDataSourceExample.scala | 22 +
 .../org/apache/spark/sql/DataFrameWriter.scala  | 59 +---
 .../datasources/jdbc/JDBCOptions.scala  | 11 ++-
 .../datasources/jdbc/JdbcRelationProvider.scala | 95 
 .../apache/spark/sql/jdbc/JDBCWriteSuite.scala  | 82 +
 9 files changed, 246 insertions(+), 73 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/50b89d05/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 4ac5fae..71bdd19 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1100,9 +1100,13 @@ CREATE TEMPORARY VIEW jdbcTable
 USING org.apache.spark.sql.jdbc
 OPTIONS (
   url "jdbc:postgresql:dbserver",
-  dbtable "schema.tablename"
+  dbtable "schema.tablename",
+  user 'username', 
+  password 'password'
 )
 
+INSERT INTO TABLE jdbcTable 
+SELECT * FROM resultTable
 {% endhighlight %}
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/50b89d05/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
 
b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
index f9087e0..1860594 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
@@ -22,6 +22,7 @@ import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
 // $example off:schema_merging$
+import java.util.Properties;
 
 // $example on:basic_parquet_example$
 import org.apache.spark.api.java.JavaRDD;
@@ -235,6 +236,8 @@ public class JavaSQLDataSourceExample {
 
   private static void runJdbcDatasetExample(SparkSession spark) {
 // $example on:jdbc_dataset$
+// Note: JDBC loading and saving can be achieved via either the load/save 
or jdbc methods
+// Loading data from a JDBC source
 Dataset jdbcDF = spark.read()
   .format("jdbc")
   .option("url", "jdbc:postgresql:dbserver")
@@ -242,6 +245,24 @@ public class JavaSQLDataSourceExample {
   .option("user", "username")
   .option("password", "password")
   .load();
+
+Properties connectionProperties = new Properties();
+connectionProperties.put("user", "username");
+connectionProperties.put("password", "password");
+Dataset jdbcDF2 = spark.read()
+  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties);
+
+// Saving data to a JDBC source
+jdbcDF.write()
+  .format("jdbc")
+  .option("url", "jdbc:postgresql:dbserver")
+  .option("dbtable", "schema.tablename")
+  .option("user", "username")
+  .option("password", "password")
+  .save();
+
+jdbcDF2.write()
+  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties);
 // $example off:jdbc_dataset$
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/50b89d05/examples/src/main/python/sql/datasource.py
---

spark git commit: [SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector

2016-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 4a8339568 -> b2a7eedcd


[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs 
for ChiSqSelector

## What changes were proposed in this pull request?

A follow up for #14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: 
http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin 

Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2a7eedc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2a7eedc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2a7eedc

Branch: refs/heads/master
Commit: b2a7eedcddf0e682ff46afd1b764d0b81ccdf395
Parents: 4a83395
Author: Shuai Lin 
Authored: Wed Sep 28 06:12:48 2016 -0400
Committer: Sean Owen 
Committed: Wed Sep 28 06:12:48 2016 -0400

--
 docs/ml-features.md  | 14 ++
 docs/mllib-feature-extraction.md | 14 ++
 2 files changed, 20 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index a39b31c..a7f710f 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1331,10 +1331,16 @@ for more details on the API.
 ## ChiSqSelector
 
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
-categorical features. ChiSqSelector orders features based on a
-[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test)
-from the class, and then filters (selects) the top features which the class 
label depends on the
-most. This is akin to yielding the features with the most predictive power.
+categorical features. ChiSqSelector uses the
+[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` 
and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This 
is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features 
instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection 
methods.
 
 **Examples**
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/mllib-feature-extraction.md
--
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 353d391..87e1e02 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -225,10 +225,16 @@ features for use in model construction. It reduces the 
size of the feature space
 both speed and statistical learning behavior.
 
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
-Chi-Squared feature selection. It operates on labeled data with categorical 
features.
-`ChiSqSelector` orders features based on a Chi-Squared test of independence 
from the class,
-and then filters (selects) the top features which the class label depends on 
the most.
-This is akin to yielding the features with the most predictive power.
+Chi-Squared feature selection. It operates on labeled data with categorical 
features. ChiSqSelector uses the
+[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` 
and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This 
is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features 
instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection 
methods.
 
 The number of features to select can be tuned using a held-out validation set.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional com

spark git commit: [MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation

2016-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b2a7eedcd -> 219003775


[MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation

## What changes were proposed in this pull request?

This PR proposes to fix wrongly indented examples in PySpark documentation

```
->>> json_sdf = spark.readStream.format("json")\
-   .schema(sdf_schema)\
-   .load(tempfile.mkdtemp())
+>>> json_sdf = spark.readStream.format("json") \\
+... .schema(sdf_schema) \\
+... .load(tempfile.mkdtemp())
```

```
-people.filter(people.age > 30).join(department, people.deptId == 
department.id)\
+people.filter(people.age > 30).join(department, people.deptId == 
department.id) \\
```

```
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
```

```
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 
4.56e-7)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 
4.56e-7)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
```

```
-...  for x in iterator:
-...   print(x)
+... for x in iterator:
+...  print(x)
```

## How was this patch tested?

Manually tested.

**Before**

![2016-09-26 8 36 
02](https://cloud.githubusercontent.com/assets/6477701/18834471/05c7a478-8431-11e6-94bb-09aa37b12ddb.png)

![2016-09-26 9 22 
16](https://cloud.githubusercontent.com/assets/6477701/18834472/06c8735c-8431-11e6-8775-78631eab0411.png)

https://cloud.githubusercontent.com/assets/6477701/18861294/29c0d5b4-84bf-11e6-99c5-3c9d913c125d.png";>

https://cloud.githubusercontent.com/assets/6477701/18861298/31694cd8-84bf-11e6-9e61-9888cb8c2089.png";>

https://cloud.githubusercontent.com/assets/6477701/18861301/359722da-84bf-11e6-97f9-5f5365582d14.png";>

**After**

![2016-09-26 9 29 
47](https://cloud.githubusercontent.com/assets/6477701/18834467/0367f9da-8431-11e6-86d9-a490d3297339.png)

![2016-09-26 9 30 
24](https://cloud.githubusercontent.com/assets/6477701/18834463/f870fae0-8430-11e6-9482-01fc47898492.png)

https://cloud.githubusercontent.com/assets/6477701/18861305/3ff88b88-84bf-11e6-902c-9f725e8a8b10.png";>

https://cloud.githubusercontent.com/assets/6477701/18863053/592fbc74-84ca-11e6-8dbf-99cf57947de8.png";>

https://cloud.githubusercontent.com/assets/6477701/18863060/601607be-84ca-11e6-80aa-a401df41c321.png";>

Author: hyukjinkwon 

Closes #15242 from HyukjinKwon/minor-example-pyspark.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21900377
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21900377
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21900377

Branch: refs/heads/master
Commit: 2190037757a81d3172f75227f7891d968e1f0d90
Parents: b2a7eed
Author: hyukjinkwon 
Authored: Wed Sep 28 06:19:04 2016 -0400
Committer: Sean Owen 
Committed: Wed Sep 28 06:19:04 2016 -0400

--
 python/pyspark/mllib/util.py| 8 
 python/pyspark/rdd.py   | 4 ++--
 python/pyspark/sql/dataframe.py | 2 +-
 python/pyspark/sql/streaming.py | 6 +++---
 4 files changed, 10 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21900377/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index 48867a0..ed6fd4b 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -140,8 +140,8 @@ class MLUtils(object):
 >>> from pyspark.mllib.regression import LabeledPoint
 >>> from glob import glob
 >>> from pyspark.mllib.util import MLUtils
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
 >>> tempFile = NamedTemporaryFile(delete=True)
 >>> tempFile.close()
 >>> MLUtils.saveAsLibSVMFile(sc.parallelize(examples), tempFile.name)
@@ -166,8 +166,8 @@ class MLUtils(object):
 >>> from tempfile import NamedTemporaryFile
 >>> from pyspark.mllib.util import MLUtils
 >>> from pyspark.mllib.regression import La

spark git commit: [MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation

2016-09-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1b02f8820 -> 4d73d5cd8


[MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation

## What changes were proposed in this pull request?

This PR proposes to fix wrongly indented examples in PySpark documentation

```
->>> json_sdf = spark.readStream.format("json")\
-   .schema(sdf_schema)\
-   .load(tempfile.mkdtemp())
+>>> json_sdf = spark.readStream.format("json") \\
+... .schema(sdf_schema) \\
+... .load(tempfile.mkdtemp())
```

```
-people.filter(people.age > 30).join(department, people.deptId == 
department.id)\
+people.filter(people.age > 30).join(department, people.deptId == 
department.id) \\
```

```
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
```

```
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 
4.56e-7)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 
4.56e-7)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
```

```
-...  for x in iterator:
-...   print(x)
+... for x in iterator:
+...  print(x)
```

## How was this patch tested?

Manually tested.

**Before**

![2016-09-26 8 36 
02](https://cloud.githubusercontent.com/assets/6477701/18834471/05c7a478-8431-11e6-94bb-09aa37b12ddb.png)

![2016-09-26 9 22 
16](https://cloud.githubusercontent.com/assets/6477701/18834472/06c8735c-8431-11e6-8775-78631eab0411.png)

https://cloud.githubusercontent.com/assets/6477701/18861294/29c0d5b4-84bf-11e6-99c5-3c9d913c125d.png";>

https://cloud.githubusercontent.com/assets/6477701/18861298/31694cd8-84bf-11e6-9e61-9888cb8c2089.png";>

https://cloud.githubusercontent.com/assets/6477701/18861301/359722da-84bf-11e6-97f9-5f5365582d14.png";>

**After**

![2016-09-26 9 29 
47](https://cloud.githubusercontent.com/assets/6477701/18834467/0367f9da-8431-11e6-86d9-a490d3297339.png)

![2016-09-26 9 30 
24](https://cloud.githubusercontent.com/assets/6477701/18834463/f870fae0-8430-11e6-9482-01fc47898492.png)

https://cloud.githubusercontent.com/assets/6477701/18861305/3ff88b88-84bf-11e6-902c-9f725e8a8b10.png";>

https://cloud.githubusercontent.com/assets/6477701/18863053/592fbc74-84ca-11e6-8dbf-99cf57947de8.png";>

https://cloud.githubusercontent.com/assets/6477701/18863060/601607be-84ca-11e6-80aa-a401df41c321.png";>

Author: hyukjinkwon 

Closes #15242 from HyukjinKwon/minor-example-pyspark.

(cherry picked from commit 2190037757a81d3172f75227f7891d968e1f0d90)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d73d5cd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d73d5cd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d73d5cd

Branch: refs/heads/branch-2.0
Commit: 4d73d5cd82ebc980f996c78f9afb8a97418ab7ab
Parents: 1b02f88
Author: hyukjinkwon 
Authored: Wed Sep 28 06:19:04 2016 -0400
Committer: Sean Owen 
Committed: Wed Sep 28 06:19:18 2016 -0400

--
 python/pyspark/mllib/util.py| 8 
 python/pyspark/rdd.py   | 4 ++--
 python/pyspark/sql/dataframe.py | 2 +-
 python/pyspark/sql/streaming.py | 6 +++---
 4 files changed, 10 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4d73d5cd/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index 48867a0..ed6fd4b 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -140,8 +140,8 @@ class MLUtils(object):
 >>> from pyspark.mllib.regression import LabeledPoint
 >>> from glob import glob
 >>> from pyspark.mllib.util import MLUtils
->>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])), \
-LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
+>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 
4.56)])),
+... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
 >>> tempFile = NamedTemporaryFile(delete=True)
 >>> tempFile.close()
 >>> MLUtils.saveAsLibSVMFile(sc.parallelize(examples), tempFile.name)
@@ -166,8 +166,8 @@ class MLUtils(object):
 >>> from tempfile import NamedTemporaryFile

spark git commit: [SPARK-17614][SQL] sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-29 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f7082ac12 -> b35b0dbbf


[SPARK-17614][SQL] sparkSession.read() .jdbc(***) use the sql syntax "where 
1=0" that Cassandra does not support

## What changes were proposed in this pull request?

Use dialect's table-exists query rather than hard-coded WHERE 1=0 query

## How was this patch tested?

Existing tests.

Author: Sean Owen 

Closes #15196 from srowen/SPARK-17614.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b35b0dbb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b35b0dbb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b35b0dbb

Branch: refs/heads/master
Commit: b35b0dbbfa3dc1bdf5e2fa1e9677d06635142b22
Parents: f7082ac
Author: Sean Owen 
Authored: Thu Sep 29 08:24:34 2016 -0400
Committer: Sean Owen 
Committed: Thu Sep 29 08:24:34 2016 -0400

--
 .../sql/execution/datasources/jdbc/JDBCRDD.scala |  6 ++
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala | 15 ++-
 2 files changed, 16 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b35b0dbb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
index a7da29f..f10615e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
@@ -58,11 +58,11 @@ object JDBCRDD extends Logging {
 val dialect = JdbcDialects.get(url)
 val conn: Connection = JdbcUtils.createConnectionFactory(url, properties)()
 try {
-  val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0")
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
   try {
 val rs = statement.executeQuery()
 try {
-  return JdbcUtils.getSchema(rs, dialect)
+  JdbcUtils.getSchema(rs, dialect)
 } finally {
   rs.close()
 }
@@ -72,8 +72,6 @@ object JDBCRDD extends Logging {
 } finally {
   conn.close()
 }
-
-throw new RuntimeException("This line is unreachable.")
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/b35b0dbb/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
index 3a6d5b7..8dd4b8f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
@@ -19,7 +19,7 @@ package org.apache.spark.sql.jdbc
 
 import java.sql.Connection
 
-import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.annotation.{DeveloperApi, Since}
 import org.apache.spark.sql.types._
 
 /**
@@ -100,6 +100,19 @@ abstract class JdbcDialect extends Serializable {
   }
 
   /**
+   * The SQL query that should be used to discover the schema of a table. It 
only needs to
+   * ensure that the result set has the same schema as the table, such as by 
calling
+   * "SELECT * ...". Dialects can override this method to return a query that 
works best in a
+   * particular database.
+   * @param table The name of the table.
+   * @return The SQL query to use for discovering the schema.
+   */
+  @Since("2.1.0")
+  def getSchemaQuery(table: String): String = {
+s"SELECT * FROM $table WHERE 1=0"
+  }
+
+  /**
* Override connection specific properties to run before a select is made.  
This is in place to
* allow dialects that need special treatment to optimize behavior.
* @param connection The connection object


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOCS] Fix th doc. of spark-streaming with kinesis

2016-09-29 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b35b0dbbf -> b2e9731ca


[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis

## What changes were proposed in this pull request?
This pr is just to fix the document of `spark-kinesis-integration`.
Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example 
code)
from publishing,  `bin/run-example streaming.KinesisWordCountASL` and 
`bin/run-example streaming.JavaKinesisWordCountASL` does not work.
Instead, it fetches the kinesis jar from the Spark Package.

Author: Takeshi YAMAMURO 

Closes #15260 from maropu/DocFixKinesis.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2e9731c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2e9731c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2e9731c

Branch: refs/heads/master
Commit: b2e9731ca494c0c60d571499f68bb8306a3c9fe5
Parents: b35b0db
Author: Takeshi YAMAMURO 
Authored: Thu Sep 29 08:26:03 2016 -0400
Committer: Sean Owen 
Committed: Thu Sep 29 08:26:03 2016 -0400

--
 docs/streaming-kinesis-integration.md | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2e9731c/docs/streaming-kinesis-integration.md
--
diff --git a/docs/streaming-kinesis-integration.md 
b/docs/streaming-kinesis-integration.md
index 96198dd..6be0b54 100644
--- a/docs/streaming-kinesis-integration.md
+++ b/docs/streaming-kinesis-integration.md
@@ -166,10 +166,7 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
  Running the Example
 To run the example,
 
-- Download Spark source and follow the [instructions](building-spark.html) to 
build Spark with profile *-Pkinesis-asl*.
-
-mvn -Pkinesis-asl -DskipTests clean package
-
+- Download a Spark binary from the [download 
site](http://spark.apache.org/downloads.html).
 
 - Set up Kinesis stream (see earlier section) within AWS. Note the name of the 
Kinesis stream and the endpoint URL corresponding to the region where the 
stream was created.
 
@@ -180,12 +177,12 @@ To run the example,


 
-bin/run-example streaming.KinesisWordCountASL [Kinesis app name] 
[Kinesis stream name] [endpoint URL]
+bin/run-example --packages 
org.apache.spark:spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 streaming.KinesisWordCountASL [Kinesis app name] [Kinesis stream name] 
[endpoint URL]
 


 
-bin/run-example streaming.JavaKinesisWordCountASL [Kinesis app name] 
[Kinesis stream name] [endpoint URL]
+bin/run-example --packages 
org.apache.spark:spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 streaming.JavaKinesisWordCountASL [Kinesis app name] [Kinesis stream name] 
[endpoint URL]
 




-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOCS] Fix th doc. of spark-streaming with kinesis

2016-09-29 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 7d612a7d5 -> ca8130050


[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis

## What changes were proposed in this pull request?
This pr is just to fix the document of `spark-kinesis-integration`.
Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example 
code)
from publishing,  `bin/run-example streaming.KinesisWordCountASL` and 
`bin/run-example streaming.JavaKinesisWordCountASL` does not work.
Instead, it fetches the kinesis jar from the Spark Package.

Author: Takeshi YAMAMURO 

Closes #15260 from maropu/DocFixKinesis.

(cherry picked from commit b2e9731ca494c0c60d571499f68bb8306a3c9fe5)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca813005
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca813005
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca813005

Branch: refs/heads/branch-2.0
Commit: ca8130050964fac8baa568918f0b67c44a7a2518
Parents: 7d612a7
Author: Takeshi YAMAMURO 
Authored: Thu Sep 29 08:26:03 2016 -0400
Committer: Sean Owen 
Committed: Thu Sep 29 08:26:14 2016 -0400

--
 docs/streaming-kinesis-integration.md | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca813005/docs/streaming-kinesis-integration.md
--
diff --git a/docs/streaming-kinesis-integration.md 
b/docs/streaming-kinesis-integration.md
index 96198dd..6be0b54 100644
--- a/docs/streaming-kinesis-integration.md
+++ b/docs/streaming-kinesis-integration.md
@@ -166,10 +166,7 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
  Running the Example
 To run the example,
 
-- Download Spark source and follow the [instructions](building-spark.html) to 
build Spark with profile *-Pkinesis-asl*.
-
-mvn -Pkinesis-asl -DskipTests clean package
-
+- Download a Spark binary from the [download 
site](http://spark.apache.org/downloads.html).
 
 - Set up Kinesis stream (see earlier section) within AWS. Note the name of the 
Kinesis stream and the endpoint URL corresponding to the region where the 
stream was created.
 
@@ -180,12 +177,12 @@ To run the example,


 
-bin/run-example streaming.KinesisWordCountASL [Kinesis app name] 
[Kinesis stream name] [endpoint URL]
+bin/run-example --packages 
org.apache.spark:spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 streaming.KinesisWordCountASL [Kinesis app name] [Kinesis stream name] 
[endpoint URL]
 


 
-bin/run-example streaming.JavaKinesisWordCountASL [Kinesis app name] 
[Kinesis stream name] [endpoint URL]
+bin/run-example --packages 
org.apache.spark:spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 streaming.JavaKinesisWordCountASL [Kinesis app name] [Kinesis stream name] 
[endpoint URL]
 




-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.

2016-10-01 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master af6ece33d -> b88cb63da


[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.

## What changes were proposed in this pull request?

Partial revert of #15277 to instead sort and store input to model rather than 
require sorted input

## How was this patch tested?

Existing tests.

Author: Sean Owen 

Closes #15299 from srowen/SPARK-17704.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b88cb63d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b88cb63d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b88cb63d

Branch: refs/heads/master
Commit: b88cb63da39786c07cb4bfa70afed32ec5eb3286
Parents: af6ece3
Author: Sean Owen 
Authored: Sat Oct 1 16:10:39 2016 -0400
Committer: Sean Owen 
Committed: Sat Oct 1 16:10:39 2016 -0400

--
 .../apache/spark/ml/feature/ChiSqSelector.scala |  2 +-
 .../spark/mllib/feature/ChiSqSelector.scala | 22 ++--
 python/pyspark/ml/feature.py|  2 +-
 3 files changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b88cb63d/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 9c131a4..d0385e2 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -193,7 +193,7 @@ final class ChiSqSelectorModel private[ml] (
 
   import ChiSqSelectorModel._
 
-  /** list of indices to select (filter). Must be ordered asc */
+  /** list of indices to select (filter). */
   @Since("1.6.0")
   val selectedFeatures: Array[Int] = chiSqSelector.selectedFeatures
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b88cb63d/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
index 706ce78..c305b36 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
@@ -35,14 +35,15 @@ import org.apache.spark.sql.{Row, SparkSession}
 /**
  * Chi Squared selector model.
  *
- * @param selectedFeatures list of indices to select (filter). Must be ordered 
asc
+ * @param selectedFeatures list of indices to select (filter).
  */
 @Since("1.3.0")
 class ChiSqSelectorModel @Since("1.3.0") (
   @Since("1.3.0") val selectedFeatures: Array[Int]) extends VectorTransformer 
with Saveable {
 
-  require(isSorted(selectedFeatures), "Array has to be sorted asc")
+  private val filterIndices = selectedFeatures.sorted
 
+  @deprecated("not intended for subclasses to use", "2.1.0")
   protected def isSorted(array: Array[Int]): Boolean = {
 var i = 1
 val len = array.length
@@ -61,7 +62,7 @@ class ChiSqSelectorModel @Since("1.3.0") (
*/
   @Since("1.3.0")
   override def transform(vector: Vector): Vector = {
-compress(vector, selectedFeatures)
+compress(vector)
   }
 
   /**
@@ -69,9 +70,8 @@ class ChiSqSelectorModel @Since("1.3.0") (
* Preserves the order of filtered features the same as their indices are 
stored.
* Might be moved to Vector as .slice
* @param features vector
-   * @param filterIndices indices of features to filter, must be ordered asc
*/
-  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
+  private def compress(features: Vector): Vector = {
 features match {
   case SparseVector(size, indices, values) =>
 val newSize = filterIndices.length
@@ -230,23 +230,23 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val chiSqTestResult = Statistics.chiSqTest(data)
+val chiSqTestResult = Statistics.chiSqTest(data).zipWithIndex
 val features = selectorType match {
   case ChiSqSelector.KBest =>
-chiSqTestResult.zipWithIndex
+chiSqTestResult
   .sortBy { case (res, _) => -res.statistic }
   .take(numTopFeatures)
   case ChiSqSelector.Percentile =>
-chiSqTestResult.zipWithIndex
+chiSqTestResult
   .sortBy { case (res, _) => -res.statistic }
   .take((chiSqTestResult.length *

spark git commit: [SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI

2016-10-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 76dc2d907 -> de3f71ed7


[SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI

## What changes were proposed in this pull request?

The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the 
class, i.e. org.apache.spark.sql.hive.thrift.HiveThriftServer2. I changed it to 
Thrift JDBC/ODBC Server (like Spark shell for spark-shell) as recommended by 
jaceklaskowski. Note the user can still change the name adding `--name "App 
Name"` parameter to the start script as before

## How was this patch tested?

By running the script with various parameters and checking the web ui

![screen shot 2016-09-27 at 12 19 12 
pm](https://cloud.githubusercontent.com/assets/13952758/1329/aebca47c-84ac-11e6-93d0-6e98684977c5.png)

Author: Alex Bozarth 

Closes #15268 from ajbozarth/spark17598.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de3f71ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de3f71ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de3f71ed

Branch: refs/heads/master
Commit: de3f71ed7a301387e870a38c14dad9508efc9743
Parents: 76dc2d9
Author: Alex Bozarth 
Authored: Mon Oct 3 10:24:30 2016 +0100
Committer: Sean Owen 
Committed: Mon Oct 3 10:24:30 2016 +0100

--
 sbin/start-thriftserver.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/de3f71ed/sbin/start-thriftserver.sh
--
diff --git a/sbin/start-thriftserver.sh b/sbin/start-thriftserver.sh
index ad7e7c5..f02f317 100755
--- a/sbin/start-thriftserver.sh
+++ b/sbin/start-thriftserver.sh
@@ -53,4 +53,4 @@ fi
 
 export SUBMIT_USAGE_FUNCTION=usage
 
-exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 "$@"
+exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift 
JDBC/ODBC Server" "$@"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…

2016-10-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master de3f71ed7 -> a27033c0b


[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,â¦

## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to install 
pandoc and rmarkdown. This was done for Jenkins in 
~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

â¦ pandoc]

Author: Jagadeesan 

Closes #15309 from jagadeesanas2/SPARK-17736.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a27033c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a27033c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a27033c0

Branch: refs/heads/master
Commit: a27033c0bbaae8f31db9b91693947ed71738ed11
Parents: de3f71e
Author: Jagadeesan 
Authored: Mon Oct 3 10:46:38 2016 +0100
Committer: Sean Owen 
Committed: Mon Oct 3 10:46:38 2016 +0100

--
 docs/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a27033c0/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 8b515e1..ffd3b57 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -19,8 +19,8 @@ installed. Also install the following libraries:
 $ sudo gem install jekyll jekyll-redirect-from pygments.rb
 $ sudo pip install Pygments
 # Following is needed only for generating API docs
-$ sudo pip install sphinx
-$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat"), repos="http://cran.stat.ucla.edu/";)'
+$ sudo pip install sphinx pypandoc
+$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/";)'
 ```
 (Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to 
replace gem with gem2.0)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…

2016-10-03 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 744aac8e6 -> b57e2acb1


[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,â¦

## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to install 
pandoc and rmarkdown. This was done for Jenkins in 
~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

â¦ pandoc]

Author: Jagadeesan 

Closes #15309 from jagadeesanas2/SPARK-17736.

(cherry picked from commit a27033c0bbaae8f31db9b91693947ed71738ed11)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b57e2acb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b57e2acb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b57e2acb

Branch: refs/heads/branch-2.0
Commit: b57e2acb134d94dafc81686da875c5dd3ea35c74
Parents: 744aac8
Author: Jagadeesan 
Authored: Mon Oct 3 10:46:38 2016 +0100
Committer: Sean Owen 
Committed: Mon Oct 3 10:49:24 2016 +0100

--
 docs/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b57e2acb/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 8b515e1..ffd3b57 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -19,8 +19,8 @@ installed. Also install the following libraries:
 $ sudo gem install jekyll jekyll-redirect-from pygments.rb
 $ sudo pip install Pygments
 # Following is needed only for generating API docs
-$ sudo pip install sphinx
-$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat"), repos="http://cran.stat.ucla.edu/";)'
+$ sudo pip install sphinx pypandoc
+$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "roxygen2", 
"testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/";)'
 ```
 (Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to 
replace gem with gem2.0)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

2016-10-04 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 126baa8d3 -> 8e8de0073


[SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set 
spark.history.ui.maxApplications

## What changes were proposed in this pull request?

Return Iterator of applications internally in history server, for consistency 
and performance. See https://github.com/apache/spark/pull/15248 for some 
back-story.

The code called by and calling HistoryServer.getApplicationList wants an 
Iterator, but this method materializes an Iterable, which potentially causes a 
performance problem. It's simpler too to make this internal method also pass 
through an Iterator.

## How was this patch tested?

Existing tests.

Author: Sean Owen 

Closes #15321 from srowen/SPARK-17671.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e8de007
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e8de007
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e8de007

Branch: refs/heads/master
Commit: 8e8de0073d71bb00baeb24c612d7841b6274f652
Parents: 126baa8
Author: Sean Owen 
Authored: Tue Oct 4 10:29:22 2016 +0100
Committer: Sean Owen 
Committed: Tue Oct 4 10:29:22 2016 +0100

--
 .../history/ApplicationHistoryProvider.scala|  2 +-
 .../deploy/history/FsHistoryProvider.scala  |  2 +-
 .../spark/deploy/history/HistoryPage.scala  |  5 +--
 .../spark/deploy/history/HistoryServer.scala|  4 +--
 .../status/api/v1/ApplicationListResource.scala | 38 +++-
 .../deploy/history/HistoryServerSuite.scala |  4 +--
 project/MimaExcludes.scala  |  2 ++
 7 files changed, 22 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8e8de007/core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala
 
b/core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala
index ba42b48..ad7a097 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala
@@ -77,7 +77,7 @@ private[history] abstract class ApplicationHistoryProvider {
*
* @return List of all know applications.
*/
-  def getListing(): Iterable[ApplicationHistoryInfo]
+  def getListing(): Iterator[ApplicationHistoryInfo]
 
   /**
* Returns the Spark UI for a specific application.

http://git-wip-us.apache.org/repos/asf/spark/blob/8e8de007/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index c5740e4..3c2d169 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -222,7 +222,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
 }
   }
 
-  override def getListing(): Iterable[FsApplicationHistoryInfo] = 
applications.values
+  override def getListing(): Iterator[FsApplicationHistoryInfo] = 
applications.values.iterator
 
   override def getApplicationInfo(appId: String): 
Option[FsApplicationHistoryInfo] = {
 applications.get(appId)

http://git-wip-us.apache.org/repos/asf/spark/blob/8e8de007/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
index b4f5a61..95b7222 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
@@ -29,10 +29,7 @@ private[history] class HistoryPage(parent: HistoryServer) 
extends WebUIPage("")
 val requestedIncomplete =
   
Option(request.getParameter("showIncomplete")).getOrElse("false").toBoolean
 
-val allApps = parent.getApplicationList()
-  .filter(_.completed != requestedIncomplete)
-val allAppsSize = allApps.size
-
+val allAppsSize = parent.getApplicationList().count(_.completed != 
requestedIncomplete)
 val providerConfig = parent.getProviderConfig()
 val content =
   

http://git-wip-us.apache.org/repos/asf/spark/blob/8e8de007/core/src/main/scala/org/apache/sp

spark git commit: [SPARK-16962][CORE][SQL] Fix misaligned record accesses for SPARC architectures

2016-10-04 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 8e8de0073 -> 7d5160883


[SPARK-16962][CORE][SQL] Fix misaligned record accesses for SPARC architectures

## What changes were proposed in this pull request?

Made changes to record length offsets to make them uniform throughout various 
areas of Spark core and unsafe

## How was this patch tested?

This change affects only SPARC architectures and was tested on X86 
architectures as well for regression.

Author: sumansomasundar 

Closes #14762 from sumansomasundar/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d516088
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d516088
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d516088

Branch: refs/heads/master
Commit: 7d5160883542f3d9dcb3babda92880985398e9af
Parents: 8e8de00
Author: sumansomasundar 
Authored: Tue Oct 4 10:31:56 2016 +0100
Committer: Sean Owen 
Committed: Tue Oct 4 10:31:56 2016 +0100

--
 .../spark/unsafe/UnsafeAlignedOffset.java   | 58 
 .../spark/unsafe/array/ByteArrayMethods.java| 31 ---
 .../spark/unsafe/map/BytesToBytesMap.java   | 57 ++-
 .../unsafe/sort/UnsafeExternalSorter.java   | 19 ---
 .../unsafe/sort/UnsafeInMemorySorter.java   | 14 +++--
 .../compression/CompressibleColumnBuilder.scala | 11 +++-
 6 files changed, 144 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d516088/common/unsafe/src/main/java/org/apache/spark/unsafe/UnsafeAlignedOffset.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/UnsafeAlignedOffset.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/UnsafeAlignedOffset.java
new file mode 100644
index 000..be62e40
--- /dev/null
+++ 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/UnsafeAlignedOffset.java
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe;
+
+/**
+ * Class to make changes to record length offsets uniform through out
+ * various areas of Apache Spark core and unsafe.  The SPARC platform
+ * requires this because using a 4 byte Int for record lengths causes
+ * the entire record of 8 byte Items to become misaligned by 4 bytes.
+ * Using a 8 byte long for record length keeps things 8 byte aligned.
+ */
+public class UnsafeAlignedOffset {
+
+  private static final int UAO_SIZE = Platform.unaligned() ? 4 : 8;
+
+  public static int getUaoSize() {
+return UAO_SIZE;
+  }
+
+  public static int getSize(Object object, long offset) {
+switch (UAO_SIZE) {
+  case 4:
+return Platform.getInt(object, offset);
+  case 8:
+return (int)Platform.getLong(object, offset);
+  default:
+throw new AssertionError("Illegal UAO_SIZE");
+}
+  }
+
+  public static void putSize(Object object, long offset, int value) {
+switch (UAO_SIZE) {
+  case 4:
+Platform.putInt(object, offset, value);
+break;
+  case 8:
+Platform.putLong(object, offset, value);
+break;
+  default:
+throw new AssertionError("Illegal UAO_SIZE");
+}
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/7d516088/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java
 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java
index cf42877..9c551ab 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java
@@ -40,6 +40,7 @@ public class ByteArrayMethods {
 }
   }
 
+  private static final boolean unaligned = Platform.unaligned();
   /**
* Optimized byte array equality check for byte arrays.
* @return true if the

spark git commit: [SPARK-25036][SQL] Avoid discarding unmoored doc comment in Scala-2.12.

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 4f1758509 -> 132bcceeb


[SPARK-25036][SQL] Avoid discarding unmoored doc comment in Scala-2.12.

## What changes were proposed in this pull request?

This PR avoid the following compilation error using sbt in Scala-2.12.

```
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn]
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn]
...
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn]
```

## How was this patch tested?

Existing UTs

Closes #22059 from kiszk/SPARK-25036d.

Authored-by: Kazuaki Ishizaki 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/132bccee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/132bccee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/132bccee

Branch: refs/heads/master
Commit: 132bcceebb7723aea9845c9e207e572ecb44a4a2
Parents: 4f17585
Author: Kazuaki Ishizaki 
Authored: Fri Aug 10 07:32:52 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 07:32:52 2018 -0500

--
 .../main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala  | 4 ++--
 .../src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/132bccee/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
index 918560a..4cdd172 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
@@ -407,7 +407,7 @@ private[spark] object RandomForest extends Logging with 
Serializable {
   metadata.isMulticlassWithCategoricalFeatures)
 logDebug("using nodeIdCache = " + nodeIdCache.nonEmpty.toString)
 
-/**
+/*
  * Performs a sequential aggregation over a partition for a particular 
tree and node.
  *
  * For each feature, the aggregate sufficient statistics are updated for 
the relevant
@@ -438,7 +438,7 @@ private[spark] object RandomForest extends Logging with 
Serializable {
   }
 }
 
-/**
+/*
  * Performs a sequential aggregation over a partition.
  *
  * Each data point contributes to one node. For each feature,

http://git-wip-us.apache.org/repos/asf/spark/blob/132bccee/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
--
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index ed9879c..75614a4 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -437,7 +437,7 @@ private[spark] class Client(
   }
 }
 
-/**
+/*
  * Distribute a file to the cluster.
  *
  * If the file's path is a "local:" URI, it's actually not distributed. 
Other files are copied


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exhaustive in Scala-2.12.

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 132bcceeb -> 1dd0f1744


[SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exhaustive in Scala-2.12.

## What changes were proposed in this pull request?

This is a follow-up pr of #22014 and #22039

We still have some more compilation errors in mllib with scala-2.12 with sbt:

```
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala:116:
 match may not be exhaustive.
[error] It would fail on the following inputs: ("silhouette", _), (_, 
"cosine"), (_, "squaredEuclidean"), (_, String()), (_, _)
[error] [warn] ($(metricName), $(distanceMeasure)) match {
[error] [warn]
```

## How was this patch tested?

Existing UTs

Closes #22058 from kiszk/SPARK-25036c.

Authored-by: Kazuaki Ishizaki 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dd0f174
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dd0f174
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dd0f174

Branch: refs/heads/master
Commit: 1dd0f1744651efadaa349b96cfd3aaafda1e9f57
Parents: 132bcce
Author: Kazuaki Ishizaki 
Authored: Fri Aug 10 07:34:09 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 07:34:09 2018 -0500

--
 .../scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1dd0f174/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
index a6d6b4e..5c1d1ae 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
@@ -119,6 +119,8 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") 
override val uid: Str
   df, $(predictionCol), $(featuresCol))
   case ("silhouette", "cosine") =>
 CosineSilhouette.computeSilhouetteScore(df, $(predictionCol), 
$(featuresCol))
+  case (mn, dm) =>
+throw new IllegalArgumentException(s"No support for metric $mn, 
distance $dm")
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 1dd0f1744 -> 91cdab51c


[MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

## What changes were proposed in this pull request?

Add ECCN notice required by http://www.apache.org/dev/crypto.html
See https://issues.apache.org/jira/browse/LEGAL-398

This should probably be backported to 2.3, 2.2, as that's when the key dep 
(commons crypto) turned up. BC is actually unused, but still there.

## How was this patch tested?

N/A

Closes #22064 from srowen/ECCN.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91cdab51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91cdab51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91cdab51

Branch: refs/heads/master
Commit: 91cdab51ccb3a4e3b6d76132d00f3da30598735b
Parents: 1dd0f17
Author: Sean Owen 
Authored: Fri Aug 10 11:15:36 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 11:15:36 2018 -0500

--
 NOTICE| 24 
 NOTICE-binary | 25 +
 2 files changed, 49 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/91cdab51/NOTICE
--
diff --git a/NOTICE b/NOTICE
index 9246cc5..23cb53f 100644
--- a/NOTICE
+++ b/NOTICE
@@ -4,3 +4,27 @@ Copyright 2014 and onwards The Apache Software Foundation.
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
+
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+<http://www.wassenaar.org/> for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data sent across the 
network between
+services.
+
+This software includes Bouncy Castle (http://bouncycastle.org/) to support the 
jets3t library.

http://git-wip-us.apache.org/repos/asf/spark/blob/91cdab51/NOTICE-binary
--
diff --git a/NOTICE-binary b/NOTICE-binary
index d56f99b..3155c38 100644
--- a/NOTICE-binary
+++ b/NOTICE-binary
@@ -5,6 +5,31 @@ This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
 
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+<http://www.wassenaar.org/> for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data se

spark git commit: [MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 e66f3f9b1 -> 7306ac71d


[MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

Add ECCN notice required by http://www.apache.org/dev/crypto.html
See https://issues.apache.org/jira/browse/LEGAL-398

This should probably be backported to 2.3, 2.2, as that's when the key dep 
(commons crypto) turned up. BC is actually unused, but still there.

N/A

Closes #22064 from srowen/ECCN.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
(cherry picked from commit 91cdab51ccb3a4e3b6d76132d00f3da30598735b)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7306ac71
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7306ac71
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7306ac71

Branch: refs/heads/branch-2.3
Commit: 7306ac71da0e31fa9655c5838dc7fcb6e4c0b7a0
Parents: e66f3f9
Author: Sean Owen 
Authored: Fri Aug 10 11:15:36 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 11:18:40 2018 -0500

--
 NOTICE | 25 +
 1 file changed, 25 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7306ac71/NOTICE
--
diff --git a/NOTICE b/NOTICE
index 6ec240e..876d606 100644
--- a/NOTICE
+++ b/NOTICE
@@ -5,6 +5,31 @@ This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
 
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+<http://www.wassenaar.org/> for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data sent across the 
network between
+services.
+
+This software includes Bouncy Castle (http://bouncycastle.org/) to support the 
jets3t library.
+
+
 
 Common Development and Distribution License 1.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b283c1f05 -> 051ea3a62


[MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html

Add ECCN notice required by http://www.apache.org/dev/crypto.html
See https://issues.apache.org/jira/browse/LEGAL-398

This should probably be backported to 2.3, 2.2, as that's when the key dep 
(commons crypto) turned up. BC is actually unused, but still there.

N/A

Closes #22064 from srowen/ECCN.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
(cherry picked from commit 91cdab51ccb3a4e3b6d76132d00f3da30598735b)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/051ea3a6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/051ea3a6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/051ea3a6

Branch: refs/heads/branch-2.2
Commit: 051ea3a6217fa1038e930906c58d8e86e9626e35
Parents: b283c1f
Author: Sean Owen 
Authored: Fri Aug 10 11:15:36 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 11:19:51 2018 -0500

--
 NOTICE | 25 +
 1 file changed, 25 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/051ea3a6/NOTICE
--
diff --git a/NOTICE b/NOTICE
index f4b64b5..737189a 100644
--- a/NOTICE
+++ b/NOTICE
@@ -5,6 +5,31 @@ This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
 
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+<http://www.wassenaar.org/> for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data sent across the 
network between
+services.
+
+This software includes Bouncy Castle (http://bouncycastle.org/) to support the 
jets3t library.
+
+
 
 Common Development and Distribution License 1.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24908][R][STYLE] removing spaces to make lintr happy

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 04c652064 -> a0a7e41cf


[SPARK-24908][R][STYLE] removing spaces to make lintr happy

## What changes were proposed in this pull request?

during my travails in porting spark builds to run on our centos worker, i 
managed to recreate (as best i could) the centos environment on our new 
ubuntu-testing machine.

while running my initial builds, lintr was crashing on some extraneous spaces 
in test_basic.R (see:  
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)

after removing those spaces, the ubuntu build happily passed the lintr tests.

## How was this patch tested?

i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
(see 
https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),
 which scp'ed a copy of test_basic.R in to the repo after the git clone.  
everything seems to be working happily.

Author: shane knapp 

Closes #21864 from shaneknapp/fixing-R-lint-spacing.

(cherry picked from commit 3efdf35327be38115b04b08e9c8d0aa282a904ab)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0a7e41c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0a7e41c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0a7e41c

Branch: refs/heads/branch-2.3
Commit: a0a7e41cfbcebf1eb0228b4acfdb0381c8eeb79f
Parents: 04c6520
Author: shane knapp 
Authored: Tue Jul 24 16:13:57 2018 -0700
Committer: Sean Owen 
Committed: Fri Aug 10 14:52:04 2018 -0500

--
 R/pkg/inst/tests/testthat/test_basic.R | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a0a7e41c/R/pkg/inst/tests/testthat/test_basic.R
--
diff --git a/R/pkg/inst/tests/testthat/test_basic.R 
b/R/pkg/inst/tests/testthat/test_basic.R
index 243f5f0..80df3d8 100644
--- a/R/pkg/inst/tests/testthat/test_basic.R
+++ b/R/pkg/inst/tests/testthat/test_basic.R
@@ -18,9 +18,9 @@
 context("basic tests for CRAN")
 
 test_that("create DataFrame from list or data.frame", {
-  tryCatch( checkJavaVersion(),
+  tryCatch(checkJavaVersion(),
 error = function(e) { skip("error on Java check") },
-warning = function(e) { skip("warning on Java check") } )
+warning = function(e) { skip("warning on Java check") })
 
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE,
  sparkConfig = sparkRTestConfig)
@@ -54,9 +54,9 @@ test_that("create DataFrame from list or data.frame", {
 })
 
 test_that("spark.glm and predict", {
-  tryCatch( checkJavaVersion(),
+  tryCatch(checkJavaVersion(),
 error = function(e) { skip("error on Java check") },
-warning = function(e) { skip("warning on Java check") } )
+warning = function(e) { skip("warning on Java check") })
 
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE,
  sparkConfig = sparkRTestConfig)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25089][R] removing lintr checks for 2.1

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 42229430f -> 09f70f5fd


[SPARK-25089][R] removing lintr checks for 2.1

## What changes were proposed in this pull request?

since 2.1 will be EOLed some time in the not too distant future, and we'll be 
moving the builds from centos to ubuntu, i think it's fine to disable R linting 
rather than going down the rabbit hole of trying to fix this stuff.

## How was this patch tested?

the build system will test this

Closes #22073 from shaneknapp/removing-lintr.

Authored-by: shane knapp 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/09f70f5f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/09f70f5f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/09f70f5f

Branch: refs/heads/branch-2.1
Commit: 09f70f5fd681dd3f38b7d9e6514f1fd63703e7f1
Parents: 4222943
Author: shane knapp 
Authored: Fri Aug 10 18:06:54 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 18:06:54 2018 -0500

--
 dev/run-tests.py | 14 --
 1 file changed, 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/09f70f5f/dev/run-tests.py
--
diff --git a/dev/run-tests.py b/dev/run-tests.py
index f24aac9..8ff8f51 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -212,18 +212,6 @@ def run_python_style_checks():
 run_cmd([os.path.join(SPARK_HOME, "dev", "lint-python")])
 
 
-def run_sparkr_style_checks():
-set_title_and_block("Running R style checks", "BLOCK_R_STYLE")
-
-if which("R"):
-# R style check should be executed after `install-dev.sh`.
-# Since warnings about `no visible global function definition` appear
-# without the installation. SEE ALSO: SPARK-9121.
-run_cmd([os.path.join(SPARK_HOME, "dev", "lint-r")])
-else:
-print("Ignoring SparkR style check as R was not found in PATH")
-
-
 def build_spark_documentation():
 set_title_and_block("Building Spark Documentation", "BLOCK_DOCUMENTATION")
 os.environ["PRODUCTION"] = "1 jekyll build"
@@ -561,8 +549,6 @@ def main():
 pass
 if not changed_files or any(f.endswith(".py") for f in changed_files):
 run_python_style_checks()
-if not changed_files or any(f.endswith(".R") for f in changed_files):
-run_sparkr_style_checks()
 
 # determine if docs were changed and if we're inside the amplab environment
 # note - the below commented out until *all* Jenkins workers can get 
`jekyll` installed


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25089][R] removing lintr checks for 2.0

2018-08-10 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 dccd8c754 -> 5ed89ceaf


[SPARK-25089][R] removing lintr checks for 2.0

## What changes were proposed in this pull request?

since 2.0 will be EOLed some time in the not too distant future, and we'll be 
moving the builds from centos to ubuntu, i think it's fine to disable R linting 
rather than going down the rabbit hole of trying to fix this stuff.

## How was this patch tested?

the build system will test this

Closes #22074 from shaneknapp/removing-lintr-2.0.

Authored-by: shane knapp 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ed89cea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ed89cea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ed89cea

Branch: refs/heads/branch-2.0
Commit: 5ed89ceaf367590f79401abbf9ff7fc66507fe4e
Parents: dccd8c7
Author: shane knapp 
Authored: Fri Aug 10 18:07:18 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 10 18:07:18 2018 -0500

--
 dev/run-tests.py | 14 --
 1 file changed, 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ed89cea/dev/run-tests.py
--
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 43e3bf6..063a879 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -212,18 +212,6 @@ def run_python_style_checks():
 run_cmd([os.path.join(SPARK_HOME, "dev", "lint-python")])
 
 
-def run_sparkr_style_checks():
-set_title_and_block("Running R style checks", "BLOCK_R_STYLE")
-
-if which("R"):
-# R style check should be executed after `install-dev.sh`.
-# Since warnings about `no visible global function definition` appear
-# without the installation. SEE ALSO: SPARK-9121.
-run_cmd([os.path.join(SPARK_HOME, "dev", "lint-r")])
-else:
-print("Ignoring SparkR style check as R was not found in PATH")
-
-
 def build_spark_documentation():
 set_title_and_block("Building Spark Documentation", "BLOCK_DOCUMENTATION")
 os.environ["PRODUCTION"] = "1 jekyll build"
@@ -555,8 +543,6 @@ def main():
 pass
 if not changed_files or any(f.endswith(".py") for f in changed_files):
 run_python_style_checks()
-if not changed_files or any(f.endswith(".R") for f in changed_files):
-run_sparkr_style_checks()
 
 # determine if docs were changed and if we're inside the amplab environment
 # note - the below commented out until *all* Jenkins workers can get 
`jekyll` installed


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: Fix typos detected by github.com/client9/misspell

2018-08-11 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 4855d5c4b -> 8ec25cd67


http://git-wip-us.apache.org/repos/asf/spark/blob/8ec25cd6/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q
--
diff --git 
a/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q
 
b/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q
index f53295e..69d671a 100644
--- 
a/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q
+++ 
b/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q
@@ -12,7 +12,7 @@ LOAD DATA LOCAL INPATH '../../data/files/T1.txt' INTO TABLE 
T1 PARTITION (ds='1'
 INSERT OVERWRITE TABLE T1 PARTITION (ds='1') select key, val from T1 where ds 
= '1';
 
 -- The plan is not converted to a map-side, since although the sorting columns 
and grouping
--- columns match, the user is issueing a distinct.
+-- columns match, the user is issuing a distinct.
 -- However, after HIVE-4310, partial aggregation is performed on the mapper
 EXPLAIN
 select count(distinct key) from T1;

http://git-wip-us.apache.org/repos/asf/spark/blob/8ec25cd6/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
index 5339799..b9ec940 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
@@ -666,7 +666,7 @@ abstract class HadoopFsRelationTest extends QueryTest with 
SQLTestUtils with Tes
 assert(expectedResult.isRight, s"Was not expecting error with 
$path: " + e)
 assert(
   e.getMessage.contains(expectedResult.right.get),
-  s"Did not find expected error message wiht $path")
+  s"Did not find expected error message with $path")
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Fix typos detected by github.com/client9/misspell

2018-08-11 Thread srowen

Fix typos detected by github.com/client9/misspell

## What changes were proposed in this pull request?

Fixing typos is sometimes very hard. It's not so easy to visually review them. 
Recently, I discovered a very useful tool for it, 
[misspell](https://github.com/client9/misspell).

This pull request fixes minor typos detected by 
[misspell](https://github.com/client9/misspell) except for the false positives. 
If you would like me to work on other files as well, let me know.

## How was this patch tested?

### before

```
$ misspell . | grep -v '.js'
R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition"
NOTICE-binary:454:16: "containd" is a misspelling of "contained"
R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition"
R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition"
R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence"
R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred"
R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output"
R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of 
"environment"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39:
 "existant" is a misspelling of "existent"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39:
 "existant" is a misspelling of "existent"
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46:
 "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19:
 "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63:
 "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46:
 "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39:
 "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20:
 "transfered" is a misspelling of "transferred"
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15:
 "orgin" is a misspelling of "origin"
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: 
"gauranteed" is a misspelling of "guaranteed"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a 
misspelling of "etc"
core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: 
"transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" 
is a misspelling of "overridden"
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is 
a misspelling of "subtracted"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: 
"agriculteur" is a misspelling of "agriculture"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: 
"truely" is a misspelling of "truly"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: 
"persistance" is a misspelling of "persistence"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: 
"persistance" is a misspelling of "persistence"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments"
dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual"
dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across"
dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across"
dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments"
docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden"
docs/structured-streaming-programming-guide.md:525:45: "processs" is a 
misspelling of "processes"
docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a 
misspelling of "BETWEEN"
docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of 
"behavior"
examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of 
"subtract"
examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of 
"subtract"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of 
"Mathematics"
licenses/LICENSE-heapq.t

spark git commit: Fix typos

2018-08-12 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master d17723479 -> 5bc7598b2


Fix typos

## What changes were proposed in this pull request?

Small typo fixes in Pyspark. These were the only ones I stumbled across after 
looking around for a while.

## How was this patch tested?

Manually

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22016 from tynan-cr/typo-fix-pyspark.

Authored-by: Tynan CR 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5bc7598b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5bc7598b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5bc7598b

Branch: refs/heads/master
Commit: 5bc7598b25dbf5ea4b3e0f149aa31fb03a5310f9
Parents: d177234
Author: Tynan CR 
Authored: Sun Aug 12 08:13:09 2018 -0500
Committer: Sean Owen 
Committed: Sun Aug 12 08:13:09 2018 -0500

--
 python/pyspark/context.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5bc7598b/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 40208ec..b77fa0e 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -126,7 +126,7 @@ class SparkContext(object):
 self.environment = environment or {}
 # java gateway must have been launched at this point.
 if conf is not None and conf._jconf is not None:
-# conf has been initialized in JVM properly, so use conf directly. 
This represent the
+# conf has been initialized in JVM properly, so use conf directly. 
This represents the
 # scenario that JVM has been launched before SparkConf is created 
(e.g. SparkContext is
 # created and then stopped, and we create a new SparkConf and new 
SparkContext again)
 self._conf = conf


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Fix Java example code in Column's comments

2018-08-12 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 5bc7598b2 -> a90b1f5d9


[MINOR][DOC] Fix Java example code in Column's comments

## What changes were proposed in this pull request?
Fix scaladoc in Column

## How was this patch tested?
None

Closes #22069 from sadhen/fix_doc_minor.

Authored-by: å¿å¬ 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a90b1f5d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a90b1f5d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a90b1f5d

Branch: refs/heads/master
Commit: a90b1f5d93d2eccca46c9c525c03a13ae55fd967
Parents: 5bc7598
Author: å¿å¬ 
Authored: Sun Aug 12 08:26:21 2018 -0500
Committer: Sean Owen 
Committed: Sun Aug 12 08:26:21 2018 -0500

--
 .../scala/org/apache/spark/sql/Column.scala | 40 ++--
 1 file changed, 20 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a90b1f5d/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
index 4eee3de..ae27690 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
@@ -345,7 +345,7 @@ class Column(val expr: Expression) extends Logging {
*
*   // Java:
*   import static org.apache.spark.sql.functions.*;
-   *   people.select( people("age").gt(21) );
+   *   people.select( people.col("age").gt(21) );
* }}}
*
* @group expr_ops
@@ -361,7 +361,7 @@ class Column(val expr: Expression) extends Logging {
*
*   // Java:
*   import static org.apache.spark.sql.functions.*;
-   *   people.select( people("age").gt(21) );
+   *   people.select( people.col("age").gt(21) );
* }}}
*
* @group java_expr_ops
@@ -376,7 +376,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") < 21 )
*
*   // Java:
-   *   people.select( people("age").lt(21) );
+   *   people.select( people.col("age").lt(21) );
* }}}
*
* @group expr_ops
@@ -391,7 +391,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") < 21 )
*
*   // Java:
-   *   people.select( people("age").lt(21) );
+   *   people.select( people.col("age").lt(21) );
* }}}
*
* @group java_expr_ops
@@ -406,7 +406,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") <= 21 )
*
*   // Java:
-   *   people.select( people("age").leq(21) );
+   *   people.select( people.col("age").leq(21) );
* }}}
*
* @group expr_ops
@@ -421,7 +421,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") <= 21 )
*
*   // Java:
-   *   people.select( people("age").leq(21) );
+   *   people.select( people.col("age").leq(21) );
* }}}
*
* @group java_expr_ops
@@ -436,7 +436,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") >= 21 )
*
*   // Java:
-   *   people.select( people("age").geq(21) )
+   *   people.select( people.col("age").geq(21) )
* }}}
*
* @group expr_ops
@@ -451,7 +451,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("age") >= 21 )
*
*   // Java:
-   *   people.select( people("age").geq(21) )
+   *   people.select( people.col("age").geq(21) )
* }}}
*
* @group java_expr_ops
@@ -588,7 +588,7 @@ class Column(val expr: Expression) extends Logging {
*   people.filter( people("inSchool") || people("isEmployed") )
*
*   // Java:
-   *   people.filter( people("inSchool").or(people("isEmployed")) );
+   *   people.filter( people.col("inSchool").or(people.col("isEmployed")) );
* }}}
*
* @group expr_ops
@@ -603,7 +603,7 @@ class Column(val expr: Expression) extends Logging {
*   people.filter( people("inSchool") || people("isEmployed") )
*
*   // Java:
-   *   people.filter( people("inSchool").or(people("isEmployed")) );
+   *   people.filter( people.col("inSchool").or(people.col("isEmployed")) );
* }}}
*
* @group java_expr_ops
@@ -618,7 +618,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("inSchool") && people("isEmployed") )
*
*   // Java:
-   *   people.select( people("inSchool").and(people("isEmployed")) );
+   *   people.select( people.col("inSchool").and(people.col("isEmployed")) );
* }}}
*
* @group expr_ops
@@ -633,7 +633,7 @@ class Column(val expr: Expression) extends Logging {
*   people.select( people("inSchool") && pe

spark-website git commit: Update configuration.html

2018-08-12 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 03f5adcb8 -> 121b56f99


Update configuration.html

spelling mistake for mention of spark-defaults.conf

Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/121b56f9
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/121b56f9
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/121b56f9

Branch: refs/heads/asf-site
Commit: 121b56f9957b3e858bcba4bde0936b5e7c0661fb
Parents: 03f5adc
Author: Joey Krabacher 
Authored: Fri Jul 27 15:16:27 2018 -0500
Committer: Sean Owen 
Committed: Sun Aug 12 19:47:48 2018 -0500

--
 site/docs/2.3.1/configuration.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/121b56f9/site/docs/2.3.1/configuration.html
--
diff --git a/site/docs/2.3.1/configuration.html 
b/site/docs/2.3.1/configuration.html
index cf17299..c8e44d6 100644
--- a/site/docs/2.3.1/configuration.html
+++ b/site/docs/2.3.1/configuration.html
@@ -2632,7 +2632,7 @@ Spark’s classpath for each application. In a Spark 
cluster running on YARN
 files are set cluster-wide, and cannot safely be changed by the 
application.
 
 The better choice is to use spark hadoop properties in the form of 
spark.hadoop.*. 
-They can be considered as same as normal spark properties which can be set in 
$SPARK_HOME/conf/spark-defalut.conf
+They can be considered as same as normal spark properties which can be set in 
$SPARK_HOME/conf/spark-defaults.conf
 
 In some cases, you may want to avoid hard-coding certain configurations in 
a SparkConf. For
 instance, Spark allows you to simply create an empty conf and set spark/spark 
hadoop properties.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Add Alluxio back to replace Tachyon as one possible data source

2018-08-12 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 121b56f99 -> dc13b5fe4


Add Alluxio back to replace Tachyon as one possible data source


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/dc13b5fe
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/dc13b5fe
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/dc13b5fe

Branch: refs/heads/asf-site
Commit: dc13b5fe46ea2977b287675be211929083f0b040
Parents: 121b56f
Author: Bin Fan 
Authored: Wed Aug 8 13:31:35 2018 -0700
Committer: Sean Owen 
Committed: Sun Aug 12 20:01:59 2018 -0500

--
 index.md| 1 +
 site/index.html | 1 +
 2 files changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/dc13b5fe/index.md
--
diff --git a/index.md b/index.md
index a548dad..e3a9557 100644
--- a/index.md
+++ b/index.md
@@ -115,6 +115,7 @@ df.where(
   on https://mesos.apache.org";>Mesos, or 
   on https://kubernetes.io/";>Kubernetes.
   Access data in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html";>HDFS,
 
+  https://alluxio.org";>Alluxio,
   https://cassandra.apache.org";>Apache Cassandra, 
   https://hbase.apache.org";>Apache HBase,
   https://hive.apache.org";>Apache Hive, 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/dc13b5fe/site/index.html
--
diff --git a/site/index.html b/site/index.html
index 5a98ddb..9c666b4 100644
--- a/site/index.html
+++ b/site/index.html
@@ -302,6 +302,7 @@ df.where(
   on https://mesos.apache.org";>Mesos, or 
   on https://kubernetes.io/";>Kubernetes.
   Access data in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html";>HDFS,
 
+  https://alluxio.org";>Alluxio,
   https://cassandra.apache.org";>Apache Cassandra, 
   https://hbase.apache.org";>Apache HBase,
   https://hive.apache.org";>Apache Hive, 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Update my affiliation

2018-08-12 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site dc13b5fe4 -> a63b5f427


Update my affiliation


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/a63b5f42
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/a63b5f42
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/a63b5f42

Branch: refs/heads/asf-site
Commit: a63b5f4279a6d5be588a8517c6d078d12d9aeacf
Parents: dc13b5f
Author: Sean Owen 
Authored: Sun Aug 12 21:17:40 2018 -0500
Committer: Sean Owen 
Committed: Sun Aug 12 21:17:40 2018 -0500

--
 committers.md| 2 +-
 site/committers.html | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/a63b5f42/committers.md
--
diff --git a/committers.md b/committers.md
index eef9f2c..9bfe5f8 100644
--- a/committers.md
+++ b/committers.md
@@ -48,7 +48,7 @@ navigation:
 |Mridul Muralidharam|Hortonworks|
 |Andrew Or|Princeton University|
 |Kay Ousterhout|LightStep|
-|Sean Owen|unaffiliated|
+|Sean Owen|Databricks|
 |Tejas Patil|Facebook|
 |Nick Pentreath|IBM|
 |Anirudh Ramanathan|Google|

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a63b5f42/site/committers.html
--
diff --git a/site/committers.html b/site/committers.html
index ff67743..2fbba19 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -364,7 +364,7 @@
 
 
   Sean Owen
-  unaffiliated
+  Databricks
 
 
   Tejas Patil


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Add CVE-2018-11770

2018-08-13 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site a63b5f427 -> e33a4bb7d


Add CVE-2018-11770


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/e33a4bb7
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/e33a4bb7
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/e33a4bb7

Branch: refs/heads/asf-site
Commit: e33a4bb7d8bbc25bb6a7d96c8bd6c13e3b05e77b
Parents: a63b5f4
Author: Sean Owen 
Authored: Mon Aug 13 09:25:05 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 13 09:25:05 2018 -0500

--
 security.md| 62 +--
 site/security.html | 99 +++--
 2 files changed, 138 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/e33a4bb7/security.md
--
diff --git a/security.md b/security.md
index f99b9bd..19231f6 100644
--- a/security.md
+++ b/security.md
@@ -10,15 +10,55 @@ navigation:
 Reporting Security Issues
 
 Apache Spark uses the standard process outlined by the [Apache Security 
Team](https://www.apache.org/security/)
-for reporting vulnerabilities.
+for reporting vulnerabilities. Note that vulnerabilities should not be 
publicly disclosed until the project has
+responded.
 
 To report a possible security vulnerability, please email 
`secur...@apache.org`. This is a
 non-public list that will reach the Apache Security team, as well as the Spark 
PMC.
 
 Known Security Issues
 
+CVE-2018-11770: Apache Spark standalone master, Mesos 
REST APIs not controlled by authentication
+
+Severity: Medium
+
+Vendor: The Apache Software Foundation
+
+Versions Affected:
+
+- Spark versions from 1.3.0, running standalone master with REST API enabled, 
or running Mesos master with cluster mode enabled
+
+Description:
+
+From version 1.3.0 onward, Spark's standalone master exposes a REST API for 
job submission, in addition 
+to the submission mechanism used by `spark-submit`. In standalone, the config 
property 
+`spark.authenticate.secret` establishes a shared secret for authenticating 
requests to submit jobs via 
+`spark-submit`. However, the REST API does not use this or any other 
authentication mechanism, and this is 
+not adequately documented. In this case, a user would be able to run a driver 
program without authenticating, 
+but not launch executors, using the REST API. This REST API is also used by 
Mesos, when set up to run in 
+cluster mode (i.e., when also running `MesosClusterDispatcher`), for job 
submission. Future versions of Spark 
+will improve documentation on these points, and prohibit setting 
`spark.authenticate.secret` when running 
+the REST APIs, to make this clear. Future versions will also disable the REST 
API by default in the 
+standalone master by changing the default value of `spark.master.rest.enabled` 
to `false`.
+
+Mitigation:
+
+For standalone masters, disable the REST API by setting 
`spark.master.rest.enabled` to `false` if it is unused, 
+and/or ensure that all network access to the REST API (port 6066 by default) 
is restricted to hosts that are 
+trusted to submit jobs. Mesos users can stop the `MesosClusterDispatcher`, 
though that will prevent them 
+from running jobs in cluster mode. Alternatively, they can ensure access to 
the `MesosRestSubmissionServer` 
+(port 7077 by default) is restricted to trusted hosts.
+
+Credit:
+
+- Imran Rashid, Cloudera
+- Fengwei Zhang, Alibaba Cloud Security Team
+
+
 CVE-2018-8024: Apache Spark XSS vulnerability in UI
 
+Severity: Medium
+
 Versions Affected:
 
 - Spark versions through 2.1.2
@@ -26,6 +66,7 @@ Versions Affected:
 - Spark 2.3.0
 
 Description:
+
 In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's 
possible for a malicious 
 user to construct a URL pointing to a Spark cluster's UI's job and stage info 
pages, and if a user can 
 be tricked into accessing the URL, can be used to cause script to execute and 
expose information from 
@@ -55,6 +96,7 @@ Versions affected:
 - Spark 2.3.0
 
 Description:
+
 In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when 
using PySpark or SparkR, 
 it's possible for a different local user to connect to the Spark application 
and impersonate the 
 user running the Spark application.
@@ -79,9 +121,11 @@ Severity: Medium
 Vendor: The Apache Software Foundation
 
 Versions Affected:
-Versions of Apache Spark from 1.6.0 until 2.1.1
+
+- Versions of Apache Spark from 1.6.0 until 2.1.1
 
 Description:
+
 In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe
 deserialization of data received by  its socket. This makes applications
 launched programmatically using the launcher API potentially
@@ -92,6 +13

spark-website git commit: Stash pride logo for next year

2018-08-13 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site e33a4bb7d -> 8eb764260


Stash pride logo for next year


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/8eb76426
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/8eb76426
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/8eb76426

Branch: refs/heads/asf-site
Commit: 8eb764260f5308960c69c212c642cd19ededf3ed
Parents: e33a4bb
Author: Sean Owen 
Authored: Sat Aug 11 21:35:01 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 13 20:12:03 2018 -0500

--
 images/spark-logo-trademark.png  | Bin 49720 -> 26999 bytes
 images/spark-logo.png| Bin 49720 -> 26999 bytes
 site/images/spark-logo-trademark.png | Bin 49720 -> 26999 bytes
 site/images/spark-logo.png   | Bin 49720 -> 26999 bytes
 4 files changed, 0 insertions(+), 0 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/images/spark-logo-trademark.png
--
diff --git a/images/spark-logo-trademark.png b/images/spark-logo-trademark.png
index eab639f..16702a9 100644
Binary files a/images/spark-logo-trademark.png and 
b/images/spark-logo-trademark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/images/spark-logo.png
--
diff --git a/images/spark-logo.png b/images/spark-logo.png
index eab639f..16702a9 100644
Binary files a/images/spark-logo.png and b/images/spark-logo.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/site/images/spark-logo-trademark.png
--
diff --git a/site/images/spark-logo-trademark.png 
b/site/images/spark-logo-trademark.png
index eab639f..16702a9 100644
Binary files a/site/images/spark-logo-trademark.png and 
b/site/images/spark-logo-trademark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/site/images/spark-logo.png
--
diff --git a/site/images/spark-logo.png b/site/images/spark-logo.png
index eab639f..16702a9 100644
Binary files a/site/images/spark-logo.png and b/site/images/spark-logo.png 
differ


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults.

2018-08-14 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 80784a1de -> 102487584


[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults.

## What changes were proposed in this pull request?

(a) disabled rest submission server by default in standalone mode
(b) fails the standalone master if rest server enabled & authentication secret 
set
(c) fails the mesos cluster dispatcher if authentication secret set
(d) doc updates
(e) when submitting a standalone app, only try the rest submission first if 
spark.master.rest.enabled=true

otherwise you'd see a 10 second pause like
18/08/09 08:13:22 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://...
18/08/09 08:13:33 WARN RestSubmissionClient: Unable to connect to server 
spark://...

I also made sure the mesos cluster dispatcher failed with the secret enabled, 
though I had to do that on slightly different code as I don't have mesos native 
libs around.

## How was this patch tested?

I ran the tests in the mesos module & in core for org.apache.spark.deploy.*

I ran a test on a cluster with standalone master to make sure I could still 
start with the right configs, and would fail the right way too.

Closes #22071 from squito/rest_doc_updates.

Authored-by: Imran Rashid 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10248758
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10248758
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10248758

Branch: refs/heads/master
Commit: 10248758438b9ff57f5669a324a716c8c6c8f17b
Parents: 80784a1
Author: Imran Rashid 
Authored: Tue Aug 14 13:02:33 2018 -0500
Committer: Sean Owen 
Committed: Tue Aug 14 13:02:33 2018 -0500

--
 .../org/apache/spark/deploy/SparkSubmitArguments.scala|  4 +++-
 .../scala/org/apache/spark/deploy/master/Master.scala | 10 +-
 .../apache/spark/deploy/rest/RestSubmissionServer.scala   |  1 +
 docs/running-on-mesos.md  |  2 ++
 docs/security.md  |  7 ++-
 .../spark/deploy/mesos/MesosClusterDispatcher.scala   |  8 
 6 files changed, 29 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10248758/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
index fb23210..0998757 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
@@ -82,7 +82,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], 
env: Map[String, S
   var driverCores: String = null
   var submissionToKill: String = null
   var submissionToRequestStatusFor: String = null
-  var useRest: Boolean = true // used internally
+  var useRest: Boolean = false // used internally
 
   /** Default properties present in the currently defined defaults file. */
   lazy val defaultSparkProperties: HashMap[String, String] = {
@@ -115,6 +115,8 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   // Use `sparkProperties` map along with env vars to fill in any missing 
parameters
   loadEnvironmentArguments()
 
+  useRest = sparkProperties.getOrElse("spark.master.rest.enabled", 
"false").toBoolean
+
   validateArguments()
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/10248758/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
index 2c78c15..e118424 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
@@ -121,10 +121,18 @@ private[deploy] class Master(
   }
 
   // Alternative application submission gateway that is stable across Spark 
versions
-  private val restServerEnabled = conf.getBoolean("spark.master.rest.enabled", 
true)
+  private val restServerEnabled = conf.getBoolean("spark.master.rest.enabled", 
false)
   private var restServer: Option[StandaloneRestServer] = None
   private var restServerBoundPort: Option[Int] = None
 
+  {
+val authKey = SecurityManager.SPARK_AUTH_SECRET_CONF
+require(conf.getOption(authKey).isEmpty || !restServerEnabled,
+  s"The RestSubmissionServer does not support authentication via 
${authKey}.  Either turn " +
+"off the RestSubmissionServer with spark.master.re

spark git commit: [SPARK-25111][BUILD] increment kinesis client/producer & aws-sdk versions

2018-08-15 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 19c45db47 -> 4d8ae0d1c


[SPARK-25111][BUILD] increment kinesis client/producer & aws-sdk versions

This PR has been superceded by #22081

## What changes were proposed in this pull request?

Increment the kinesis client, producer and transient AWS SDK versions to a more 
recent release.

This is to help with the move off bouncy castle of #21146 and #22081; the goal 
is that moving up to the new SDK will allow a JVM with unlimited JCE but 
without bouncy castle to work with Kinesis endpoints.

Why this specific set of artifacts? it syncs up with the 1.11.271 AWS SDK used 
by hadoop 3.0.3, hadoop-3.1. and hadoop 3.1.1; that's been stable for the uses 
there (s3, STS, dynamo).

## How was this patch tested?

Running all the external/kinesis-asl tests via maven with java 8.121 & 
unlimited JCE, without bouncy castle (#21146); default endpoint of us-west.2. 
Without this SDK update I was getting http cert validation errors, with it they 
went away.

# This PR is not ready without

* Jenkins test runs to see what it is happy with
* more testing: repeated runs, another endpoint
* looking at the new deprecation warnings and selectively addressing them (the 
AWS SDKs are pretty aggressive about deprecation, but sometimes they increase 
the complexity of the client code or block some codepaths off completely)

Closes #22099 from steveloughran/cloud/SPARK-25111-kinesis.

Authored-by: Steve Loughran 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d8ae0d1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d8ae0d1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d8ae0d1

Branch: refs/heads/master
Commit: 4d8ae0d1c846560e1cac3480d73f8439968430a6
Parents: 19c45db
Author: Steve Loughran 
Authored: Wed Aug 15 12:06:11 2018 -0500
Committer: Sean Owen 
Committed: Wed Aug 15 12:06:11 2018 -0500

--
 pom.xml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4d8ae0d1/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 979d709..33c15f2 100644
--- a/pom.xml
+++ b/pom.xml
@@ -143,11 +143,11 @@
 1.8.2
 hadoop2
 0.9.4
-1.7.3
+1.8.10
 
-1.11.76
+1.11.271
 
-0.10.2
+0.12.8
 
 4.5.6
 4.4.10


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23654][BUILD] remove jets3t as a dependency of spark

2018-08-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master ea63a7a16 -> b3e6fe7c4


[SPARK-23654][BUILD] remove jets3t as a dependency of spark

## What changes were proposed in this pull request?

Remove jets3t dependency, and bouncy castle which it brings in; update licenses 
and deps
Note this just takes over https://github.com/apache/spark/pull/21146

## How was this patch tested?

Existing tests.

Closes #22081 from srowen/SPARK-23654.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b3e6fe7c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b3e6fe7c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b3e6fe7c

Branch: refs/heads/master
Commit: b3e6fe7c46bad991e850d258887400db5f7d7736
Parents: ea63a7a
Author: Sean Owen 
Authored: Thu Aug 16 12:34:23 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 12:34:23 2018 -0700

--
 LICENSE-binary  |  2 --
 NOTICE  |  2 --
 NOTICE-binary   | 21 ---
 core/pom.xml|  4 +--
 dev/deps/spark-deps-hadoop-2.6  |  4 ---
 dev/deps/spark-deps-hadoop-2.7  |  4 ---
 dev/deps/spark-deps-hadoop-3.1  |  4 ---
 external/kafka-0-10-assembly/pom.xml|  5 
 external/kafka-0-8-assembly/pom.xml |  5 
 external/kinesis-asl-assembly/pom.xml   |  5 
 licenses-binary/LICENSE-bouncycastle-bcprov.txt |  7 -
 pom.xml | 28 
 12 files changed, 13 insertions(+), 78 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b3e6fe7c/LICENSE-binary
--
diff --git a/LICENSE-binary b/LICENSE-binary
index c033dd8..b94ea90 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -228,7 +228,6 @@ org.apache.xbean:xbean-asm5-shaded
 com.squareup.okhttp3:logging-interceptor
 com.squareup.okhttp3:okhttp
 com.squareup.okio:okio
-net.java.dev.jets3t:jets3t
 org.apache.spark:spark-catalyst_2.11
 org.apache.spark:spark-kvstore_2.11
 org.apache.spark:spark-launcher_2.11
@@ -447,7 +446,6 @@ org.slf4j:jul-to-slf4j
 org.slf4j:slf4j-api
 org.slf4j:slf4j-log4j12
 com.github.scopt:scopt_2.11
-org.bouncycastle:bcprov-jdk15on
 
 core/src/main/resources/org/apache/spark/ui/static/dagre-d3.min.js
 core/src/main/resources/org/apache/spark/ui/static/*dataTables*

http://git-wip-us.apache.org/repos/asf/spark/blob/b3e6fe7c/NOTICE
--
diff --git a/NOTICE b/NOTICE
index 23cb53f..fefe08b 100644
--- a/NOTICE
+++ b/NOTICE
@@ -26,5 +26,3 @@ The following provides more details on the included 
cryptographic software:
 This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
 support authentication, and encryption and decryption of data sent across the 
network between
 services.
-
-This software includes Bouncy Castle (http://bouncycastle.org/) to support the 
jets3t library.

http://git-wip-us.apache.org/repos/asf/spark/blob/b3e6fe7c/NOTICE-binary
--
diff --git a/NOTICE-binary b/NOTICE-binary
index ad256aa..b707c43 100644
--- a/NOTICE-binary
+++ b/NOTICE-binary
@@ -27,8 +27,6 @@ This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/comm
 support authentication, and encryption and decryption of data sent across the 
network between
 services.
 
-This software includes Bouncy Castle (http://bouncycastle.org/) to support the 
jets3t library.
-
 
 // --
 // NOTICE file corresponding to the section 4d of The Apache License,
@@ -1162,25 +1160,6 @@ NonlinearMinimizer class in package 
breeze.optimize.proximal is distributed with
 2015, Debasish Das (Verizon), all rights reserved.
 
 
-   =
-   ==  NOTICE file corresponding to section 4(d) of the Apache License,   ==
-   ==  Version 2.0, in this case for the distribution of jets3t.  ==
-   =
-
-   This product includes software developed by:
-
-   The Apache Software Foundation (http://www.apache.org/).
-
-   The ExoLab Project (http://www.exolab.org/)
-
-   Sun Microsystems (http://www.sun.com/)
-
-   Codehaus (http://castor.codehaus.org)
-
-   Tatu Saloranta (http://wiki.fasterxml.com/TatuSaloranta)
-
-
-
 stream-lib
 Copyright 2016 AddThis
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b3e6fe7c/core/pom.

spark git commit: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB

2018-08-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b3e6fe7c4 -> e50192494


[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB

## What changes were proposed in this pull request?
logNumExamples in KMeans/BiKM/GMM/AFT/NB

## How was this patch tested?
existing tests

Closes #21561 from zhengruifeng/alg_logNumExamples.

Authored-by: zhengruifeng 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5019249
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5019249
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5019249

Branch: refs/heads/master
Commit: e50192494d1ae1bdaf845ddd388189998c1a2403
Parents: b3e6fe7
Author: zhengruifeng 
Authored: Thu Aug 16 15:23:32 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 15:23:32 2018 -0700

--
 .../spark/ml/classification/LinearSVC.scala |  2 +-
 .../ml/classification/LogisticRegression.scala  |  2 +-
 .../spark/ml/classification/NaiveBayes.scala| 14 +++-
 .../spark/ml/clustering/BisectingKMeans.scala   |  3 ++-
 .../spark/ml/clustering/GaussianMixture.scala   |  5 +
 .../ml/regression/AFTSurvivalRegression.scala   |  1 +
 .../mllib/clustering/BisectingKMeans.scala  | 23 ++--
 .../apache/spark/mllib/clustering/KMeans.scala  | 10 +++--
 8 files changed, 42 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e5019249/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index 20f9366..1b5c02f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -189,7 +189,7 @@ class LinearSVC @Since("2.2.0") (
 (new MultivariateOnlineSummarizer, new MultiClassSummarizer)
   )(seqOp, combOp, $(aggregationDepth))
 }
-instr.logNamedValue(Instrumentation.loggerTags.numExamples, 
summarizer.count)
+instr.logNumExamples(summarizer.count)
 instr.logNamedValue("lowestLabelWeight", 
labelSummarizer.histogram.min.toString)
 instr.logNamedValue("highestLabelWeight", 
labelSummarizer.histogram.max.toString)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e5019249/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 408d92e..6f0804f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -519,7 +519,7 @@ class LogisticRegression @Since("1.2.0") (
 (new MultivariateOnlineSummarizer, new MultiClassSummarizer)
   )(seqOp, combOp, $(aggregationDepth))
 }
-instr.logNamedValue(Instrumentation.loggerTags.numExamples, 
summarizer.count)
+instr.logNumExamples(summarizer.count)
 instr.logNamedValue("lowestLabelWeight", 
labelSummarizer.histogram.min.toString)
 instr.logNamedValue("highestLabelWeight", 
labelSummarizer.histogram.max.toString)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e5019249/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index f65d397..51495c1 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -162,19 +162,21 @@ class NaiveBayes @Since("1.5.0") (
 // TODO: similar to reduceByKeyLocally to save one stage.
 val aggregated = dataset.select(col($(labelCol)), w, 
col($(featuresCol))).rdd
   .map { row => (row.getDouble(0), (row.getDouble(1), 
row.getAs[Vector](2)))
-  }.aggregateByKey[(Double, DenseVector)]((0.0, 
Vectors.zeros(numFeatures).toDense))(
+  }.aggregateByKey[(Double, DenseVector, Long)]((0.0, 
Vectors.zeros(numFeatures).toDense, 0L))(
   seqOp = {
- case ((weightSum: Double, featureSum: DenseVector), (weight, 
features)) =>
+ case ((weightSum, featureSum, count), (weight, features)) =>
requireValues(features)
BLAS.axpy(weight, features, featureSum)
-   (weightSum +

spark-website git commit: For projects using names that are likely proscribed, either update to use current conforming name or remove the link, for now

2018-08-16 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 8eb764260 -> fff22b750


For projects using names that are likely proscribed, either update to use 
current conforming name or remove the link, for now

Author: Sean Owen 

Closes #137 from srowen/SparkNames.


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/fff22b75
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/fff22b75
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/fff22b75

Branch: refs/heads/asf-site
Commit: fff22b750f5e84fc5bc7fb8e795a1ac72ccefd99
Parents: 8eb7642
Author: Sean Owen 
Authored: Thu Aug 16 15:25:57 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 15:25:57 2018 -0700

--
 site/third-party-projects.html | 28 
 third-party-projects.md| 28 
 2 files changed, 24 insertions(+), 32 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/fff22b75/site/third-party-projects.html
--
diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index c3b8dd1..36259b7 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -202,6 +202,14 @@
   
 This page tracks external software projects that supplement Apache 
Spark and add to its ecosystem.
 
+To add a project, open a pull request against the https://github.com/apache/spark-website";>spark-website 
+repository. Add an entry to 
+https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md";>this
 markdown file, 
+then run jekyll build to generate the HTML too. Include
+both in your pull request. See the README in this repo for more 
information.
+
+Note that all project and product names should follow trademark guidelines.
+
 spark-packages.org
 
 https://spark-packages.org/";>spark-packages.org is an 
external, 
@@ -211,7 +219,7 @@ Apache Spark. You can add a package as long as you have a 
GitHub repository.
 Infrastructure Projects
 
 
-  https://github.com/spark-jobserver/spark-jobserver";>Spark Job 
Server - 
+  https://github.com/spark-jobserver/spark-jobserver";>REST Job 
Server for Apache Spark - 
 REST interface for managing and submitting Spark jobs on the same cluster 
 (see http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server";>blog
 post 
 for details)
@@ -220,25 +228,16 @@ for details)
 running Spark
   http://alluxio.org/";>Alluxio (nÃ©e Tachyon) - Memory speed 
virtual distributed 
 storage system that supports running Spark
-  https://github.com/datastax/spark-cassandra-connector";>Spark 
Cassandra Connector - 
-Easily load your Cassandra data into Spark and Spark SQL; from Datastax
   https://github.com/filodb/FiloDB";>FiloDB - a Spark 
integrated analytical/columnar 
 database, with in-memory option capable of sub-second concurrent queries
   http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql";>ElasticSearch
 - 
 Spark SQL Integration
-  https://github.com/tresata/spark-scalding";>Spark-Scalding - 
Easily transition 
-Cascading/Scalding code to Spark
-  http://zeppelin-project.org/";>Zeppelin - an IPython-like 
notebook for Spark. There 
-is also https://github.com/tribbloid/ISpark";>ISpark, and the 
-https://github.com/andypetrella/spark-notebook/";>Spark 
Notebook.
-  http://www.ibm.com/developerworks/servicemanagement/tc/pcs/index.html";>IBM
 Spectrum Conductor with Spark - 
-cluster management software that integrates with Spark
+  http://zeppelin-project.org/";>Zeppelin - Multi-purpose 
notebook which supports 20+ language backends,
+including Apache Spark
   https://github.com/EclairJS/eclairjs-node";>EclairJS - 
enables Node.js developers to code
 against Spark, and data scientists to use Javascript in Jupyter notebooks.
   https://github.com/SnappyDataInc/snappydata";>SnappyData - 
an open source 
 OLTP + OLAP database integrated with Spark on the same JVMs.
-  https://github.com/DataSystemsLab/GeoSpark";>GeoSpark - 
Geospatial RDDs and joins
-  https://github.com/ispras/spark-openstack";>Spark Cluster Deploy 
Tools for OpenStack
   https://github.com/Hydrospheredata/mist";>Mist - Serverless 
proxy for Spark cluster (spark middleware)
 
 
@@ -253,8 +252,6 @@ system for large-scale, distributed data analysis, built on 
top of Apache Hadoop
 on top of Shark and Spark
   https://github.com/adobe-research/spindle";>Spindle - 
Spark/Parquet-based web 
 analytics query engine
-  http://simin.me/projects/spatialspark/";>Spark Spatial - 
Spatial joins an

spark-website git commit: Mention correctness issues as blockers in developer docs; mention 18 month release branch maintenance guideline

2018-08-16 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site fff22b750 -> 0bef76657


Mention correctness issues as blockers in developer docs; mention 18 month 
release branch maintenance guideline

Author: Sean Owen 

Closes #136 from srowen/Correctness.


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/0bef7665
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/0bef7665
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/0bef7665

Branch: refs/heads/asf-site
Commit: 0bef766573fac8e6451c49c271f8edcb8f9c7aff
Parents: fff22b7
Author: Sean Owen 
Authored: Thu Aug 16 16:00:57 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 16:00:57 2018 -0700

--
 contributing.md | 16 ++--
 site/contributing.html  | 19 +--
 site/versioning-policy.html | 11 ---
 versioning-policy.md| 11 ---
 4 files changed, 47 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/0bef7665/contributing.md
--
diff --git a/contributing.md b/contributing.md
index fd9fec0..9b964a8 100644
--- a/contributing.md
+++ b/contributing.md
@@ -244,10 +244,11 @@ Example: `Fix typos in Foo scaladoc`
 1. Set required fields:
 1. **Issue Type**. Generally, Bug, Improvement and New Feature are the 
only types used in Spark.
 1. **Priority**. Set to Major or below; higher priorities are 
generally reserved for 
-committers to set. JIRA tends to unfortunately conflate "size" and 
"importance" in its 
+committers to set. The main exception is correctness or data-loss 
issues, which can be flagged as
+Blockers. JIRA tends to unfortunately conflate "size" and "importance" 
in its 
 Priority field values. Their meaning is roughly:
  1. Blocker: pointless to release without this change as the 
release would be unusable 
- to a large minority of users
+ to a large minority of users. Correctness and data loss issues 
should be considered Blockers.
  1. Critical: a large minority of users are missing important 
functionality without 
  this, and/or a workaround is difficult
  1. Major: a small minority of users are missing important 
functionality without this, 
@@ -258,6 +259,17 @@ Example: `Fix typos in Foo scaladoc`
 1. **Component**
 1. **Affects Version**. For Bugs, assign at least one version that is 
known to exhibit the 
 problem or need the change
+1. **Label**. Not widely used, except for the following:
+ - `correctness`: a correctness issue
+ - `data-loss`: a data loss issue
+ - `release-notes`: the change's effects need mention in release 
notes. The JIRA or pull request
+ should include detail suitable for inclusion in release notes -- 
see "Docs Text" below.
+ - `starter`: small, simple change suitable for new contributors
+1. **Docs Text**: For issues that require an entry in the release 
notes, this should contain the
+information that the release manager should include in Release Notes. 
This should include a short summary
+of what behavior is impacted, and detail on what behavior changed. It 
can be provisionally filled out
+when the JIRA is opened, but will likely need to be updated with final 
details when the issue is
+resolved.
 1. Do not set the following fields:
 1. **Fix Version**. This is assigned by committers only when resolved.
 1. **Target Version**. This is assigned by committers to indicate a PR 
has been accepted for 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/0bef7665/site/contributing.html
--
diff --git a/site/contributing.html b/site/contributing.html
index ce8580a..9fc45e5 100644
--- a/site/contributing.html
+++ b/site/contributing.html
@@ -463,11 +463,12 @@ Example: Fix typos in Foo scaladoc
 
   Issue Type. Generally, Bug, Improvement and New 
Feature are the only types used in Spark.
   Priority. Set to Major or below; higher 
priorities are generally reserved for 
- committers to set. JIRA tends to unfortunately conflate “size” 
and “importance” in its 
+ committers to set. The main exception is correctness or data-loss issues, 
which can be flagged as
+ Blockers. JIRA tends to unfortunately conflate “size” and 
“importance” in its 
  Priority field values. Their meaning is roughly:
 
   Blocker: pointless to release without this change as

spark git commit: [DOCS] Update configuration.md

2018-08-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master e59dd8fa0 -> 709f541dd


[DOCS] Update configuration.md

changed $SPARK_HOME/conf/spark-default.conf to 
$SPARK_HOME/conf/spark-defaults.conf

no testing necessary as this was a change to documentation.

Closes #22116 from KraFusion/patch-1.

Authored-by: Joey Krabacher 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/709f541d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/709f541d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/709f541d

Branch: refs/heads/master
Commit: 709f541dd0c41c2ae8c0871b2593be9100bfc4ee
Parents: e59dd8f
Author: Joey Krabacher 
Authored: Thu Aug 16 16:47:52 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 16:47:52 2018 -0700

--
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/709f541d/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 9c4742a..0270dc2 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -2213,7 +2213,7 @@ Spark's classpath for each application. In a Spark 
cluster running on YARN, thes
 files are set cluster-wide, and cannot safely be changed by the application.
 
 The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
-They can be considered as same as normal spark properties which can be set in 
`$SPARK_HOME/conf/spark-default.conf`
+They can be considered as same as normal spark properties which can be set in 
`$SPARK_HOME/conf/spark-defaults.conf`
 
 In some cases, you may want to avoid hard-coding certain configurations in a 
`SparkConf`. For
 instance, Spark allows you to simply create an empty conf and set spark/spark 
hadoop properties.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [DOCS] Fix cloud-integration.md Typo

2018-08-16 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 709f541dd -> 30be71e91


[DOCS] Fix cloud-integration.md Typo

Corrected typo; changed spark-default.conf to spark-defaults.conf

Closes #22125 from KraFusion/patch-2.

Authored-by: Joey Krabacher 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30be71e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30be71e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30be71e9

Branch: refs/heads/master
Commit: 30be71e91251971ad45c018538395cbebebc0c83
Parents: 709f541
Author: Joey Krabacher 
Authored: Thu Aug 16 16:48:51 2018 -0700
Committer: Sean Owen 
Committed: Thu Aug 16 16:48:51 2018 -0700

--
 docs/cloud-integration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/30be71e9/docs/cloud-integration.md
--
diff --git a/docs/cloud-integration.md b/docs/cloud-integration.md
index 18e8fe7..36753f6 100644
--- a/docs/cloud-integration.md
+++ b/docs/cloud-integration.md
@@ -104,7 +104,7 @@ Spark jobs must authenticate with the object stores to 
access data within them.
 and `AWS_SESSION_TOKEN` environment variables and sets the associated 
authentication options
 for the `s3n` and `s3a` connectors to Amazon S3.
 1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
-1. Authentication details may be manually added to the Spark configuration in 
`spark-default.conf`
+1. Authentication details may be manually added to the Spark configuration in 
`spark-defaults.conf`
 1. Alternatively, they can be programmatically set in the `SparkConf` instance 
used to configure 
 the application's `SparkContext`.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [DOCS] Fixed NDCG formula issues

2018-08-20 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 60af2501e -> 219ed7b48


[DOCS] Fixed NDCG formula issues

When j is 0, log(j+1) will be 0, and this leads to division by 0 issue.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22090 from yueguoguo/patch-1.

Authored-by: Zhang Le 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/219ed7b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/219ed7b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/219ed7b4

Branch: refs/heads/master
Commit: 219ed7b487c2dfb5007247f77ebf1b3cc73cecb5
Parents: 60af250
Author: Zhang Le 
Authored: Mon Aug 20 14:59:03 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 20 14:59:03 2018 -0500

--
 docs/mllib-evaluation-metrics.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/219ed7b4/docs/mllib-evaluation-metrics.md
--
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index d9dbbab..c65ecdc 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -462,13 +462,13 @@ $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & 
\text{otherwise}.\end{
   Normalized Discounted Cumulative Gain
   
 $NDCG(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{IDCG(D_i, 
k)}\sum_{j=0}^{n-1}
-  \frac{rel_{D_i}(R_i(j))}{\text{ln}(j+1)}} \\
+  \frac{rel_{D_i}(R_i(j))}{\text{ln}(j+2)}} \\
 \text{Where} \\
 \hspace{5 mm} n = 
\text{min}\left(\text{max}\left(|R_i|,|D_i|\right),k\right) \\
-\hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 
1} \frac{1}{\text{ln}(j+1)}$
+\hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 
1} \frac{1}{\text{ln}(j+2)}$
   
   
-https://en.wikipedia.org/wiki/Information_retrieval#Discounted_cumulative_gain";>NDCG
 at k is a
+https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG";>NDCG
 at k is a
 measure of how many of the first k recommended documents are in the 
set of true relevant documents averaged
 across all users. In contrast to precision at k, this metric takes 
into account the order of the recommendations
 (documents are assumed to be in order of decreasing relevance).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [DOCS] Fixed NDCG formula issues

2018-08-20 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 ea01e362f -> 9702bb637


[DOCS] Fixed NDCG formula issues

When j is 0, log(j+1) will be 0, and this leads to division by 0 issue.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22090 from yueguoguo/patch-1.

Authored-by: Zhang Le 
Signed-off-by: Sean Owen 
(cherry picked from commit 219ed7b487c2dfb5007247f77ebf1b3cc73cecb5)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9702bb63
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9702bb63
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9702bb63

Branch: refs/heads/branch-2.3
Commit: 9702bb637d5ac665fefaa96cc69c5f92553f613a
Parents: ea01e36
Author: Zhang Le 
Authored: Mon Aug 20 14:59:03 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 20 14:59:21 2018 -0500

--
 docs/mllib-evaluation-metrics.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9702bb63/docs/mllib-evaluation-metrics.md
--
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index 7f27754..ac398fb 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -462,13 +462,13 @@ $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & 
\text{otherwise}.\end{
   Normalized Discounted Cumulative Gain
   
 $NDCG(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{IDCG(D_i, 
k)}\sum_{j=0}^{n-1}
-  \frac{rel_{D_i}(R_i(j))}{\text{ln}(j+1)}} \\
+  \frac{rel_{D_i}(R_i(j))}{\text{ln}(j+2)}} \\
 \text{Where} \\
 \hspace{5 mm} n = 
\text{min}\left(\text{max}\left(|R_i|,|D_i|\right),k\right) \\
-\hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 
1} \frac{1}{\text{ln}(j+1)}$
+\hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 
1} \frac{1}{\text{ln}(j+2)}$
   
   
-https://en.wikipedia.org/wiki/Information_retrieval#Discounted_cumulative_gain";>NDCG
 at k is a
+https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG";>NDCG
 at k is a
 measure of how many of the first k recommended documents are in the 
set of true relevant documents averaged
 across all users. In contrast to precision at k, this metric takes 
into account the order of the recommendations
 (documents are assumed to be in order of decreasing relevance).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [BUILD] Close stale PRs

2018-08-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 4fb96e510 -> b8788b3e7


[BUILD] Close stale PRs

Closes #16411
Closes #21870
Closes #21794
Closes #21610
Closes #21961
Closes #21940
Closes #21870
Closes #22118
Closes #21624
Closes #19528
Closes #18424

Closes #22159 from srowen/Stale.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8788b3e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8788b3e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8788b3e

Branch: refs/heads/master
Commit: b8788b3e79d0d508e3a910fefd7e9cff4c6d6245
Parents: 4fb96e5
Author: Sean Owen 
Authored: Tue Aug 21 08:18:21 2018 -0500
Committer: Sean Owen 
Committed: Tue Aug 21 08:18:21 2018 -0500

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25073][YARN] AM and Executor Memory validation message is not proper while submitting spark yarn application

2018-08-24 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master ab3302895 -> c20916a5d


[SPARK-25073][YARN] AM and Executor Memory validation message is not proper 
while submitting spark yarn application

**## What changes were proposed in this pull request?**
When the yarn.nodemanager.resource.memory-mb or 
yarn.scheduler.maximum-allocation-mb
 memory assignment is insufficient, Spark always reports an error request to 
adjust
yarn.scheduler.maximum-allocation-mb even though in message it shows the memory 
value
of yarn.nodemanager.resource.memory-mb parameter,As the error Message is bit 
misleading to the user  we can modify the same, We can keep the error message 
same as executor memory validation message.

Defintion of **yarn.nodemanager.resource.memory-mb:**
Amount of physical memory, in MB, that can be allocated for containers. It 
means the amount of memory YARN can utilize on this node and therefore this 
property should be lower then the total memory of that machine.
**yarn.scheduler.maximum-allocation-mb:**
It defines the maximum memory allocation available for a container in MB
it means RM can only allocate memory to containers in increments of 
"yarn.scheduler.minimum-allocation-mb" and not exceed 
"yarn.scheduler.maximum-allocation-mb" and It should not be more than total 
allocated memory of the Node.

**## How was this patch tested?**
Manually tested in hdfs-Yarn clustaer

Closes #22199 from sujith71955/maste_am_log.

Authored-by: s71955 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c20916a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c20916a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c20916a5

Branch: refs/heads/master
Commit: c20916a5dc4a7e771463838e797cb944569f6259
Parents: ab33028
Author: s71955 
Authored: Fri Aug 24 08:58:19 2018 -0500
Committer: Sean Owen 
Committed: Fri Aug 24 08:58:19 2018 -0500

--
 .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c20916a5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
--
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 75614a4..698fc2c 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -344,7 +344,8 @@ private[spark] class Client(
 if (amMem > maxMem) {
   throw new IllegalArgumentException(s"Required AM memory ($amMemory" +
 s"+$amMemoryOverhead MB) is above the max threshold ($maxMem MB) of 
this cluster! " +
-"Please increase the value of 'yarn.scheduler.maximum-allocation-mb'.")
+"Please check the values of 'yarn.scheduler.maximum-allocation-mb' 
and/or " +
+"'yarn.nodemanager.resource.memory-mb'.")
 }
 logInfo("Will allocate AM container, with %d MB memory including %d MB 
overhead".format(
   amMem,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

< 12 13 14 15 16 17 18 19 20 >

1601 - 1700 of 1962 matches

Mail list logo