date:20160921

spark-website git commit: Add Israel Spark meetup to community page per request. Use https for meetup while we're here. Pick up a recent change to paper hyperlink reflected only in markdown, not HTML

2016-09-21 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site eee58685c -> 7c96b646e


Add Israel Spark meetup to community page per request. Use https for meetup 
while we're here. Pick up a recent change to paper hyperlink reflected only in 
markdown, not HTML


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7c96b646
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7c96b646
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7c96b646

Branch: refs/heads/asf-site
Commit: 7c96b646eb2de2dbe6aec91a82d86699e13c59c5
Parents: eee5868
Author: Sean Owen 
Authored: Wed Sep 21 08:32:16 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 08:32:16 2016 +0100

--
 community.md| 57 +---
 site/community.html | 57 +---
 site/research.html  |  2 +-
 3 files changed, 61 insertions(+), 55 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/7c96b646/community.md
--
diff --git a/community.md b/community.md
index d856409..b0c5b3a 100644
--- a/community.md
+++ b/community.md
@@ -56,84 +56,87 @@ navigation:
 Spark Meetups are grass-roots events organized and hosted by leaders and 
champions in the community around the world. Check out http://spark.meetup.com";>http://spark.meetup.com to find a Spark 
meetup in your part of the world. Below is a partial list of Spark meetups.
 
   
-http://www.meetup.com/spark-users/";>Bay Area Spark Meetup.
+https://www.meetup.com/spark-users/";>Bay Area Spark Meetup.
 This group has been running since January 2012 in the San Francisco area.
-The meetup page also contains an http://www.meetup.com/spark-users/events/past/";>archive of past 
meetups, including videos and http://www.meetup.com/spark-users/files/";>slides for most of the 
recent talks.
+The meetup page also contains an https://www.meetup.com/spark-users/events/past/";>archive of past 
meetups, including videos and https://www.meetup.com/spark-users/files/";>slides for most of the 
recent talks.
   
   
-http://www.meetup.com/Spark-Barcelona/";>Barcelona Spark Meetup
+https://www.meetup.com/Spark-Barcelona/";>Barcelona Spark 
Meetup
   
   
-http://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark 
Meetup
+https://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark 
Meetup
   
   
-http://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark 
Meetup
+https://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark 
Meetup
   
   
-http://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark 
Meetup
+https://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark 
Meetup
   
   
-http://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston 
Spark Meetup
+https://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston 
Spark Meetup
   
   
-http://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark 
Meetup
+https://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark 
Meetup
   
   
-http://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark 
Users
+https://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark 
Users
   
   
-http://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch 
Apache Spark Meetup
+https://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch 
Apache Spark Meetup
   
   
-http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati Apache 
Spark Meetup
+https://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati 
Apache Spark Meetup
   
   
-http://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou 
Spark Meetup
+https://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou 
Spark Meetup
   
   
-http://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad 
Spark Meetup
+https://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad 
Spark Meetup
   
   
-http://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana 
Spark Meetup
+https://www.meetup.com/israel-spark-users/";>Israel Spark Users
   
   
-http://www.meetup.com/Spark-London/";>London Spark Meetup
+https://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana 
Spark Meetup
   
   
-http://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark 
Meetup
+https://www.meetup.com/Spark-London/";>London Spark Meetup
   
   
-http://www.meetup.com/Mumbai-Spark-Meetup/";>Mumbai Spark 
Meetup
+https://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark 
Meetup
   
   
-http://www.meetup.com/Apache-Spark-in-Moscow/";>Moscow Spark 
Meetup
+https://www.meetup.com/Mumbai-

spark git commit: [CORE][DOC] Fix errors in comments

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master e48ebc4e4 -> 61876a427


[CORE][DOC] Fix errors in comments

## What changes were proposed in this pull request?
While reading source code of CORE and SQL core, I found some minor errors in 
comments such as extra space, missing blank line and grammar error.

I fixed these minor errors and might find more during my source code study.

## How was this patch tested?
Manually build

Author: wm...@hotmail.com 

Closes #15151 from wangmiao1981/mem.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/61876a42
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/61876a42
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/61876a42

Branch: refs/heads/master
Commit: 61876a42793bde0da90f54b44255148ed54b7f61
Parents: e48ebc4
Author: wm...@hotmail.com 
Authored: Wed Sep 21 09:33:29 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 09:33:29 2016 +0100

--
 core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +-
 sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
index cae7c9e..f255f5b 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
@@ -28,7 +28,7 @@ import org.apache.spark.util.Utils
  * :: DeveloperApi ::
  * This class represent an unique identifier for a BlockManager.
  *
- * The first 2 constructors of this class is made private to ensure that 
BlockManagerId objects
+ * The first 2 constructors of this class are made private to ensure that 
BlockManagerId objects
  * can be created only using the apply method in the companion object. This 
allows de-duplication
  * of ID objects. Also, constructor parameters are private to ensure that 
parameters cannot be
  * modified from outside this class.

http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index 0f6292d..6d7ac0f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -937,7 +937,7 @@ object SparkSession {
   }
 
   /**
-   * Return true if Hive classes can be loaded, otherwise false.
+   * @return true if Hive classes can be loaded, otherwise false.
*/
   private[spark] def hiveClassesArePresent: Boolean = {
 try {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 61876a427 -> d3b886976


[SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files 
recursively

## What changes were proposed in this pull request?
Users would like to add a directory as dependency in some cases, they can use 
```SparkContext.addFile``` with argument ```recursive=true``` to recursively 
add all files under the directory by using Scala. But Python users can only add 
file not directory, we should also make it supported.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #15140 from yanboliang/spark-17585.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3b88697
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3b88697
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3b88697

Branch: refs/heads/master
Commit: d3b88697638dcf32854fe21a6c53dfb3782773b9
Parents: 61876a4
Author: Yanbo Liang 
Authored: Wed Sep 21 01:37:03 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 01:37:03 2016 -0700

--
 .../spark/api/java/JavaSparkContext.scala   | 13 +
 python/pyspark/context.py   |  7 +--
 python/pyspark/tests.py | 20 +++-
 python/test_support/hello.txt   |  1 -
 python/test_support/hello/hello.txt |  1 +
 .../test_support/hello/sub_hello/sub_hello.txt  |  1 +
 6 files changed, 35 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 
b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
index 131f36f..4e50c26 100644
--- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark 
jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
+sc.addFile(path, recursive)
+  }
+
+  /**
* Adds a JAR dependency for all tasks to be executed on this SparkContext 
in the future.
* The `path` passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI.

http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 5c32f8e..7a7f59c 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -767,7 +767,7 @@ class SparkContext(object):
 SparkContext._next_accum_id += 1
 return Accumulator(SparkContext._next_accum_id - 1, value, accum_param)
 
-def addFile(self, path):
+def addFile(self, path, recursive=False):
 """
 Add a file to be downloaded with this Spark job on every node.
 The C{path} passed can be either a local file, a file in HDFS
@@ -778,6 +778,9 @@ class SparkContext(object):
 L{SparkFiles.get(fileName)} with the
 filename to find its download location.
 
+A directory can be given if the recursive option is set to True.
+Currently directories are only supported for Hadoop-supported 
filesystems.
+
 >>> from pyspark import SparkFiles
 >>> path = os.path.join(tempdir, "test.txt")
 >>> with open(path, "w") as testFile:
@@ -790,7 +793,7 @@ class SparkContext(object):
 >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 [100, 200, 300, 400]
 """
-self._jsc.sc().addFile(path)
+self._jsc.sc().addFile(path, recursive)
 
 def addPyFile(self, path):
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 0a029b6..b075691 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -409,13 +409,23 @@ class AddFileTests(PySparkTestCase):
 self.assertEqual("Hello World!"

spark git commit: [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master d3b886976 -> 7654385f2


[SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in 
Word2VecModel

## What changes were proposed in this pull request?

The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with 
the highest similarity to the query vector currently sorts the collection of 
similarities for every vocabulary element. This involves making multiple copies 
of the collection of similarities while doing a (relatively) expensive sort. It 
would be more efficient to find the best matches by maintaining a bounded 
priority queue and populating it with a single pass over the vocabulary, and 
that is exactly what this patch does.

## How was this patch tested?

This patch adds no user-visible functionality and its correctness should be 
exercised by existing tests.  To ensure that this approach is actually faster, 
I made a microbenchmark for `findSynonyms`:

```
object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new 
SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
  model.findSynonyms(key, 5)
  model.findSynonyms(key, 10)
  model.findSynonyms(key, 25)
  model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
  }
}
```

I ran this test on a model generated from the complete works of Jane Austen and 
found that the new approach was over 3x faster than the old approach.  (If the 
`num` argument to `findSynonyms` is very close to the vocabulary size, the new 
approach will have less of an advantage over the old one.)

Author: William Benton 

Closes #15150 from willb/SPARK-17595.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7654385f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7654385f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7654385f

Branch: refs/heads/master
Commit: 7654385f268a3f481c4574ce47a19ab21155efd5
Parents: d3b8869
Author: William Benton 
Authored: Wed Sep 21 09:45:06 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 09:45:06 2016 +0100

--
 .../org/apache/spark/mllib/feature/Word2Vec.scala  | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7654385f/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 42ca966..2364d43 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -35,6 +35,7 @@ import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.util.{Loader, Saveable}
 import org.apache.spark.rdd._
 import org.apache.spark.sql.SparkSession
+import org.apache.spark.util.BoundedPriorityQueue
 import org.apache.spark.util.Utils
 import org.apache.spark.util.random.XORShiftRandom
 
@@ -555,7 +556,7 @@ class Word2VecModel private[spark] (
   num: Int,
   wordOpt: Option[String]): Array[(String, Double)] = {
 require(num > 0, "Number of similar words should > 0")
-// TODO: optimize top-k
+
 val fVector = vector.toArray.map(_.toFloat)
 val cosineVec = Array.fill[Float](numWords)(0)
 val alpha: Float = 1
@@ -580,10 +581,16 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+for(i <- cosVec.indices) {
+  pq += Tuple2(wordList(i), cosVec(i))
+}
+
+val scored = pq.toSeq.sortBy(-_._2)
 
 val filtered = wordOpt match {
-  case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)
+  case Some(w) => scored.filter(tup => w != tup._1)
   case None => scored
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 7654385f2 -> 3977223a3


[SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on 
double value

## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend 
is a big double. The reason is that Remainder converts the double dividend to 
decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the 
codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
++
|   a|
++
|-6.0|
++
```

## How was this patch tested?

Unit test.

Author: Sean Zhong 

Closes #15171 from clockfly/SPARK-17617.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3977223a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3977223a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3977223a

Branch: refs/heads/master
Commit: 3977223a3268aaf6913a325ee459139a4a302b1c
Parents: 7654385
Author: Sean Zhong 
Authored: Wed Sep 21 16:53:34 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Sep 21 16:53:34 2016 +0800

--
 .../spark/sql/catalyst/expressions/arithmetic.scala  |  6 +-
 .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3977223a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index 13e539a..6f3db79 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -310,7 +310,11 @@ case class Remainder(left: Expression, right: Expression)
   if (input1 == null) {
 null
   } else {
-integral.rem(input1, input2)
+input1 match {
+  case d: Double => d % input2.asInstanceOf[java.lang.Double]
+  case f: Float => f % input2.asInstanceOf[java.lang.Float]
+  case _ => integral.rem(input1, input2)
+}
   }
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/3977223a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
index 5c98242..0d86efd 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
@@ -175,6 +175,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper
 }
   }
 
+  test("SPARK-17617: % (Remainder) double % double on super big double") {
+val leftDouble = Literal(-5083676433652386516D)
+val rightDouble = Literal(10D)
+checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D)
+
+// Float has smaller precision
+val leftFloat = Literal(-5083676433652386516F)
+val rightFloat = Literal(10F)
+checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F)
+  }
+
   test("Abs") {
 testNumericDataTypes { convert =>
   val input = Literal(convert(1))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 726f05716 -> 65295bad9


[SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on 
double value

## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend 
is a big double. The reason is that Remainder converts the double dividend to 
decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the 
codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
++
|   a|
++
|-6.0|
++
```

## How was this patch tested?

Unit test.

Author: Sean Zhong 

Closes #15171 from clockfly/SPARK-17617.

(cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65295bad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65295bad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65295bad

Branch: refs/heads/branch-2.0
Commit: 65295bad9b2a81f2394a52eedeb31da0ed2c4847
Parents: 726f057
Author: Sean Zhong 
Authored: Wed Sep 21 16:53:34 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Sep 21 16:53:55 2016 +0800

--
 .../spark/sql/catalyst/expressions/arithmetic.scala  |  6 +-
 .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65295bad/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index fa459aa..01c5d82 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -309,7 +309,11 @@ case class Remainder(left: Expression, right: Expression)
   if (input1 == null) {
 null
   } else {
-integral.rem(input1, input2)
+input1 match {
+  case d: Double => d % input2.asInstanceOf[java.lang.Double]
+  case f: Float => f % input2.asInstanceOf[java.lang.Float]
+  case _ => integral.rem(input1, input2)
+}
   }
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/65295bad/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
index 2e37887..069c3b3 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
@@ -173,6 +173,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper
 // }
   }
 
+  test("SPARK-17617: % (Remainder) double % double on super big double") {
+val leftDouble = Literal(-5083676433652386516D)
+val rightDouble = Literal(10D)
+checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D)
+
+// Float has smaller precision
+val leftFloat = Literal(-5083676433652386516F)
+val rightFloat = Literal(10F)
+checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F)
+  }
+
   test("Abs") {
 testNumericDataTypes { convert =>
   val input = Literal(convert(1))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 8646b84fb -> 8f88412c3


[SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on 
double value

## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend 
is a big double. The reason is that Remainder converts the double dividend to 
decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the 
codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
++
|   a|
++
|-6.0|
++
```

## How was this patch tested?

Unit test.

Author: Sean Zhong 

Closes #15171 from clockfly/SPARK-17617.

(cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8f88412c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8f88412c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8f88412c

Branch: refs/heads/branch-1.6
Commit: 8f88412c31dc840c15df9822638645381c82a2fe
Parents: 8646b84
Author: Sean Zhong 
Authored: Wed Sep 21 16:53:34 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Sep 21 16:57:44 2016 +0800

--
 .../spark/sql/catalyst/expressions/arithmetic.scala  |  6 +-
 .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8f88412c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index cfae285..7905825 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -273,7 +273,11 @@ case class Remainder(left: Expression, right: Expression) 
extends BinaryArithmet
   if (input1 == null) {
 null
   } else {
-integral.rem(input1, input2)
+input1 match {
+  case d: Double => d % input2.asInstanceOf[java.lang.Double]
+  case f: Float => f % input2.asInstanceOf[java.lang.Float]
+  case _ => integral.rem(input1, input2)
+}
   }
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/8f88412c/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
index 72285c6..a5930d4 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala
@@ -172,6 +172,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper
 // }
   }
 
+  test("SPARK-17617: % (Remainder) double % double on super big double") {
+val leftDouble = Literal(-5083676433652386516D)
+val rightDouble = Literal(10D)
+checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D)
+
+// Float has smaller precision
+val leftFloat = Literal(-5083676433652386516F)
+val rightFloat = Literal(10F)
+checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F)
+  }
+
   test("Abs") {
 testNumericDataTypes { convert =>
   val input = Literal(convert(1))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 3977223a3 -> 28fafa3ee


[SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist

## What changes were proposed in this pull request?

The `ListingFileCatalog` lists files given a set of resolved paths. If a folder 
is deleted at any time between the paths were resolved and the file catalog can 
check for the folder, the Spark job fails. This may abruptly stop long running 
StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.

## How was this patch tested?

Unit test in `FileCatalogSuite`

Author: Burak Yavuz 

Closes #15153 from brkyvz/SPARK-17599.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28fafa3e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28fafa3e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28fafa3e

Branch: refs/heads/master
Commit: 28fafa3ee8f3478fa441e7bd6c8fd4ab482ca98e
Parents: 3977223
Author: Burak Yavuz 
Authored: Wed Sep 21 17:07:16 2016 +0800
Committer: Wenchen Fan 
Committed: Wed Sep 21 17:07:16 2016 +0800

--
 .../sql/execution/datasources/ListingFileCatalog.scala  | 12 ++--
 .../sql/execution/datasources/FileCatalogSuite.scala| 11 +++
 2 files changed, 21 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/28fafa3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
index 60742bd..3253208 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.execution.datasources
 
+import java.io.FileNotFoundException
+
 import scala.collection.mutable
 
 import org.apache.hadoop.fs.{FileStatus, LocatedFileStatus, Path}
@@ -97,8 +99,14 @@ class ListingFileCatalog(
 logTrace(s"Listing $path on driver")
 
 val childStatuses = {
-  val stats = fs.listStatus(path)
-  if (pathFilter != null) stats.filter(f => 
pathFilter.accept(f.getPath)) else stats
+  try {
+val stats = fs.listStatus(path)
+if (pathFilter != null) stats.filter(f => 
pathFilter.accept(f.getPath)) else stats
+  } catch {
+case _: FileNotFoundException =>
+  logWarning(s"The directory $path was not found. Was it deleted 
very recently?")
+  Array.empty[FileStatus]
+  }
 }
 
 childStatuses.map {

http://git-wip-us.apache.org/repos/asf/spark/blob/28fafa3e/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
index 0d9ea51..5c8d322 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
@@ -67,4 +67,15 @@ class FileCatalogSuite extends SharedSQLContext {
 
 }
   }
+
+  test("ListingFileCatalog: folders that don't exist don't throw exceptions") {
+withTempDir { dir =>
+  val deletedFolder = new File(dir, "deleted")
+  assert(!deletedFolder.exists())
+  val catalog1 = new ListingFileCatalog(
+spark, Seq(new Path(deletedFolder.getCanonicalPath)), Map.empty, None)
+  // doesn't throw an exception
+  assert(catalog1.listLeafFiles(catalog1.paths).isEmpty)
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 28fafa3ee -> b366f1849


[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate 
(FPR) test

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests. False Positive Rate (FPR) is a popular univariate 
statistical test for feature selection. We add a chiSquare Selector based on 
False Positive Rate (FPR) test in this PR, like it is implemented in 
scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

## How was this patch tested?

Add Scala ut

Author: Peng, Meng 

Closes #14597 from mpjlu/fprChiSquare.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b366f184
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b366f184
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b366f184

Branch: refs/heads/master
Commit: b366f18496e1ce8bd20fe58a0245ef7d91819a03
Parents: 28fafa3
Author: Peng, Meng 
Authored: Wed Sep 21 10:17:38 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:17:38 2016 +0100

--
 .../apache/spark/ml/feature/ChiSqSelector.scala |  69 -
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  28 -
 .../spark/mllib/feature/ChiSqSelector.scala | 103 ++-
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |  11 +-
 .../mllib/feature/ChiSqSelectorSuite.scala  |  18 
 project/MimaExcludes.scala  |   3 +
 python/pyspark/mllib/feature.py |  71 -
 7 files changed, 262 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b366f184/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 1482eb3..0c6a37b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -27,6 +27,7 @@ import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
 import org.apache.spark.mllib.feature
+import org.apache.spark.mllib.feature.ChiSqSelectorType
 import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
 import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
 import org.apache.spark.rdd.RDD
@@ -54,11 +55,47 @@ private[feature] trait ChiSqSelectorParams extends Params
 
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
+
+  final val percentile = new DoubleParam(this, "percentile",
+"Percentile of features that selector will select, ordered by statistics 
value descending.",
+ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  def getPercentile: Double = $(percentile)
+
+  final val alpha = new DoubleParam(this, "alpha",
+"The highest p-value for features to be kept.",
+ParamValidators.inRange(0, 1))
+  setDefault(alpha -> 0.05)
+
+  /** @group getParam */
+  def getAlpha: Double = $(alpha)
+
+  /**
+   * The ChiSqSelector supports KBest, Percentile, FPR selection,
+   * which is the same as ChiSqSelectorType defined in MLLIB.
+   * when call setNumTopFeatures, the selectorType is set to KBest
+   * when call setPercentile, the selectorType is set to Percentile
+   * when call setAlpha, the selectorType is set to FPR
+   */
+  final val selectorType = new Param[String](this, "selectorType",
+"ChiSqSelector Type: KBest, Percentile, FPR")
+  setDefault(selectorType -> ChiSqSelectorType.KBest.toString)
+
+  /** @group getParam */
+  def getChiSqSelectorType: String = $(selectorType)
 }
 
 /**
  * Chi-Squared feature selection, which selects categorical features to use 
for predicting a
  * categorical label.
+ * The selector supports three selection methods: `KBest`, `Percentile` and 
`FPR`.
+ * `KBest` chooses the `k` top features according to a chi-squared test.
+ * `Percentile` is similar but chooses a fraction of all features instead of a 
fixed number.
+ * `FPR` chooses all features whose false positive rate meets some threshold.
+ * By default, the selection method is `KBest`, the default number of top 
features is 50.
+ * User can use setNumTopFeatures, setPercentile and setAlpha to set different 
selection methods.
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: 
String)
@@ -69,7 +106,22 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") 
override val uid: Str
 
   /** @group setParam */
   @Sin

spark git commit: [SPARK-17219][ML] Add NaN value handling in Bucketizer

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b366f1849 -> 57dc326bd


[SPARK-17219][ML] Add NaN value handling in Bucketizer

## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing 
NaN value.
Sometimes, null value might also be useful to users, so in these cases, 
Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal 
exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh 

Author: VinceShieh 

Closes #14858 from VinceShieh/spark-17219.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57dc326b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57dc326b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57dc326b

Branch: refs/heads/master
Commit: 57dc326bd00cf0a49da971e9c573c48ae28acaa2
Parents: b366f18
Author: VinceShieh 
Authored: Wed Sep 21 10:20:57 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:20:57 2016 +0100

--
 docs/ml-features.md |  6 +++-
 .../apache/spark/ml/feature/Bucketizer.scala| 13 +---
 .../spark/ml/feature/QuantileDiscretizer.scala  |  9 --
 .../spark/ml/feature/BucketizerSuite.scala  | 31 
 .../ml/feature/QuantileDiscretizerSuite.scala   | 29 +++---
 python/pyspark/ml/feature.py|  5 
 .../spark/sql/DataFrameStatFunctions.scala  |  4 ++-
 7 files changed, 85 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 746593f..a39b31c 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1102,7 +1102,11 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned
-categorical features. The number of bins is set by the `numBuckets` parameter.
+categorical features. The number of bins is set by the `numBuckets` parameter. 
It is possible
+that the number of buckets used will be less than this value, for example, if 
there are too few
+distinct values of the input to create enough distinct quantiles. Note also 
that NaN values are
+handled specially and placed into their own bucket. For example, if 4 buckets 
are used, then
+non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
 The bin ranges are chosen using an approximate algorithm (see the 
documentation for
 
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions)
 for a
 detailed description). The precision of the approximation can be controlled 
with the

http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
index 100d9e7..ec0ea05 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
@@ -106,7 +106,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 @Since("1.6.0")
 object Bucketizer extends DefaultParamsReadable[Bucketizer] {
 
-  /** We require splits to be of length >= 3 and to be in strictly increasing 
order. */
+  /**
+   * We require splits to be of length >= 3 and to be in strictly increasing 
order.
+   * No NaN split should be accepted.
+   */
   private[feature] def checkSplits(splits: Array[Double]): Boolean = {
 if (splits.length < 3) {
   false
@@ -114,10 +117,10 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
   var i = 0
   val n = splits.length - 1
   while (i < n) {
-if (splits(i) >= splits(i + 1)) return false
+if (splits(i) >= splits(i + 1) || splits(i).isNaN) return false
 i += 1
   }
-  true
+  !splits(n).isNaN
 }
   }
 
@@ -126,7 +129,9 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] 
{
* @throws SparkException if a feature is < splits.head or > splits.last
*/
   private[feature] def binarySearchForBuckets(splits: Array[Double], feature: 
Double): Double = {
-if (feature == splits.last) {
+if (feature.isNaN) {
+  splits.length - 1
+} else if (feature == splits.last) {
   splits.length - 2

spark git commit: [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 57dc326bd -> 25a020be9


[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding 
buffer as default for maxCharsPerColumn option in CSV

## What changes were proposed in this pull request?

This PR includes the changes below:

1. Upgrade Univocity library from 2.1.1 to 2.2.1

  This includes some performance improvement and also enabling auto-extending 
buffer in `maxCharsPerColumn` option in CSV. Please refer the [release 
notes](https://github.com/uniVocity/univocity-parsers/releases).

2. Remove useless `rowSeparator` variable existing in `CSVOptions`

  We have this unused variable in 
[CSVOptions.scala#L127](https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127)
 but it seems possibly causing confusion that it actually does not care of 
`\r\n`. For example, we have an issue open about this, 
[SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing 
this variable.

  This variable is virtually not being used because we rely on 
`LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.

3. Set the default value of `maxCharsPerColumn` to auto-expending.

  We are setting 100 for the length of each column. It'd be more sensible 
we allow auto-expending rather than fixed length by default.

  To make sure, using `-1` is being described in the release note, 
[2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #15138 from HyukjinKwon/SPARK-17583.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25a020be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25a020be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25a020be

Branch: refs/heads/master
Commit: 25a020be99b6a540e4001e59e40d5d1c8aa53812
Parents: 57dc326
Author: hyukjinkwon 
Authored: Wed Sep 21 10:35:29 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 10:35:29 2016 +0100

--
 dev/deps/spark-deps-hadoop-2.2   | 2 +-
 dev/deps/spark-deps-hadoop-2.3   | 2 +-
 dev/deps/spark-deps-hadoop-2.4   | 2 +-
 dev/deps/spark-deps-hadoop-2.6   | 2 +-
 dev/deps/spark-deps-hadoop-2.7   | 2 +-
 python/pyspark/sql/readwriter.py | 2 +-
 python/pyspark/sql/streaming.py  | 2 +-
 sql/core/pom.xml | 2 +-
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala| 4 ++--
 .../apache/spark/sql/execution/datasources/csv/CSVOptions.scala  | 4 +---
 .../apache/spark/sql/execution/datasources/csv/CSVParser.scala   | 2 --
 .../scala/org/apache/spark/sql/streaming/DataStreamReader.scala  | 4 ++--
 12 files changed, 13 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.2
--
diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2
index a7259e2..f4f92c6 100644
--- a/dev/deps/spark-deps-hadoop-2.2
+++ b/dev/deps/spark-deps-hadoop-2.2
@@ -159,7 +159,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.3
--
diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3
index 6986ab5..3db013f 100644
--- a/dev/deps/spark-deps-hadoop-2.3
+++ b/dev/deps/spark-deps-hadoop-2.3
@@ -167,7 +167,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.4
--
diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4
index 75cccb3..7171010 100644
--- a/dev/deps/spark-deps-hadoop-2.4
+++ b/dev/deps/spark-deps-hadoop-2.4
@@ -167,7 +167,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.1.1.jar
+univocity-parsers-2.2.1.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xmlenc-0.52.jar

http

spark git commit: [CORE][MINOR] Add minor code change to TaskState and Task

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 25a020be9 -> dd7561d33


[CORE][MINOR] Add minor code change to TaskState and Task

## What changes were proposed in this pull request?
- TaskState and ExecutorState expose isFailed and isFinished functions. It can 
be useful to add test coverage for different states. Currently, Other enums do 
not expose any functions so this PR aims just these two enums.
- `private` access modifier is added for Finished Task States Set
- A minor doc change is added.

## How was this patch tested?
New Unit tests are added and run locally.

Author: erenavsarogullari 

Closes #15143 from erenavsarogullari/SPARK-17584.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd7561d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd7561d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd7561d3

Branch: refs/heads/master
Commit: dd7561d33761d119ded09cfba072147292bf6964
Parents: 25a020b
Author: erenavsarogullari 
Authored: Wed Sep 21 14:47:18 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 14:47:18 2016 +0100

--
 core/src/main/scala/org/apache/spark/TaskState.scala  | 2 +-
 core/src/main/scala/org/apache/spark/scheduler/Task.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/TaskState.scala
--
diff --git a/core/src/main/scala/org/apache/spark/TaskState.scala 
b/core/src/main/scala/org/apache/spark/TaskState.scala
index cbace7b..596ce67 100644
--- a/core/src/main/scala/org/apache/spark/TaskState.scala
+++ b/core/src/main/scala/org/apache/spark/TaskState.scala
@@ -21,7 +21,7 @@ private[spark] object TaskState extends Enumeration {
 
   val LAUNCHING, RUNNING, FINISHED, FAILED, KILLED, LOST = Value
 
-  val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST)
+  private val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST)
 
   type TaskState = Value
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/scheduler/Task.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/Task.scala 
b/core/src/main/scala/org/apache/spark/scheduler/Task.scala
index 1ed36bf..ea9dc39 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/Task.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/Task.scala
@@ -239,7 +239,7 @@ private[spark] object Task {
* and return the task itself as a serialized ByteBuffer. The caller can 
then update its
* ClassLoaders and deserialize the task.
*
-   * @return (taskFiles, taskJars, taskBytes)
+   * @return (taskFiles, taskJars, taskProps, taskBytes)
*/
   def deserializeWithDependencies(serializedTask: ByteBuffer)
 : (HashMap[String, Long], HashMap[String, Long], Properties, ByteBuffer) = 
{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17590][SQL] Analyze CTE definitions at once and allow CTE subquery to define CTE

2016-09-21 Thread hvanhovell

Repository: spark
Updated Branches:
  refs/heads/master dd7561d33 -> 248922fd4


[SPARK-17590][SQL] Analyze CTE definitions at once and allow CTE subquery to 
define CTE

## What changes were proposed in this pull request?

We substitute logical plan with CTE definitions in the analyzer rule 
CTESubstitution. A CTE definition can be used in the logical plan for multiple 
times, and its analyzed logical plan should be the same. We should not analyze 
CTE definitions multiple times when they are reused in the query.

By analyzing CTE definitions before substitution, we can support defining CTE 
in subquery.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh 

Closes #15146 from viirya/cte-analysis-once.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/248922fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/248922fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/248922fd

Branch: refs/heads/master
Commit: 248922fd4fb7c11a40304431e8cc667a8911a906
Parents: dd7561d
Author: Liang-Chi Hsieh 
Authored: Wed Sep 21 06:53:42 2016 -0700
Committer: Herman van Hovell 
Committed: Wed Sep 21 06:53:42 2016 -0700

--
 .../apache/spark/sql/catalyst/parser/SqlBase.g4 |  2 +-
 .../spark/sql/catalyst/analysis/Analyzer.scala  |  5 ++--
 .../spark/sql/catalyst/parser/AstBuilder.scala  |  2 +-
 .../org/apache/spark/sql/SubquerySuite.scala| 25 
 4 files changed, 29 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
--
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
index 7023c0c..de2f9ee 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
@@ -262,7 +262,7 @@ ctes
 ;
 
 namedQuery
-: name=identifier AS? '(' queryNoWith ')'
+: name=identifier AS? '(' query ')'
 ;
 
 tableProvider

http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index cc62d5e..ae8869f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -116,15 +116,14 @@ class Analyzer(
   )
 
   /**
-   * Substitute child plan with cte definitions
+   * Analyze cte definitions and substitute child plan with analyzed cte 
definitions.
*/
   object CTESubstitution extends Rule[LogicalPlan] {
-// TODO allow subquery to define CTE
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators  {
   case With(child, relations) =>
 substituteCTE(child, relations.foldLeft(Seq.empty[(String, 
LogicalPlan)]) {
   case (resolved, (name, relation)) =>
-resolved :+ name -> ResolveRelations(substituteCTE(relation, 
resolved))
+resolved :+ name -> execute(substituteCTE(relation, resolved))
 })
   case other => other
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 69d68fa..12a70b7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -108,7 +108,7 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with 
Logging {
* This is only used for Common Table Expressions.
*/
   override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = 
withOrigin(ctx) {
-SubqueryAlias(ctx.name.getText, plan(ctx.queryNoWith), None)
+SubqueryAlias(ctx.name.getText, plan(ctx.query), None)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySui

spark git commit: [BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error

2016-09-21 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 65295bad9 -> 45bccdd9c


[BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error

## What changes were proposed in this pull request?
This PR is to fix the code style errors.

## How was this patch tested?
Manual.

Before:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[153] 
(sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[196] 
(sizes) LineLength: Line is longer than 100 characters (found 108).
[ERROR] 
src/main/java/org/apache/spark/network/client/TransportClient.java:[239] 
(sizes) LineLength: Line is longer than 100 characters (found 115).
[ERROR] 
src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119]
 (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] 
src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129]
 (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] 
src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] 
(modifier) ModifierOrder: 'static' modifier out of order with the JLS 
suggestions.
[ERROR] 
src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[184] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] 
src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[304] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
 ```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Weiqing Yang 

Closes #15175 from Sherry302/javastylefix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45bccdd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45bccdd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45bccdd9

Branch: refs/heads/branch-2.0
Commit: 45bccdd9c2b180323958db0f92ca8ee591e502ef
Parents: 65295ba
Author: Weiqing Yang 
Authored: Wed Sep 21 15:18:02 2016 +0100
Committer: Sean Owen 
Committed: Wed Sep 21 15:18:02 2016 +0100

--
 .../org/apache/spark/network/client/TransportClient.java | 11 ++-
 .../spark/network/server/TransportRequestHandler.java|  7 ---
 .../org/apache/spark/network/util/LevelDBProvider.java   |  2 +-
 .../apache/spark/network/yarn/YarnShuffleService.java|  4 ++--
 4 files changed, 13 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45bccdd9/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
index a67683b..17ac91d 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
@@ -150,8 +150,8 @@ public class TransportClient implements Closeable {
   if (future.isSuccess()) {
 long timeTaken = System.currentTimeMillis() - startTime;
 if (logger.isTraceEnabled()) {
-  logger.trace("Sending request {} to {} took {} ms", 
streamChunkId, getRemoteAddress(channel),
-timeTaken);
+  logger.trace("Sending request {} to {} took {} ms", 
streamChunkId,
+getRemoteAddress(channel), timeTaken);
 }
   } else {
 String errorMsg = String.format("Failed to send request %s to %s: 
%s", streamChunkId,
@@ -193,8 +193,8 @@ public class TransportClient implements Closeable {
 if (future.isSuccess()) {
   long timeTaken = System.currentTimeMillis() - startTime;
   if (logger.isTraceEnabled()) {
-logger.trace("Sending request for {} to {} took {} ms", 
streamId, getRemoteAddress(channel),
-  timeTaken);
+logger.trace("Sending request for {} to {} took {} ms", 
streamId,
+  getRemoteAddress(channel), timeTaken);
   }
 } else {
   String errorMsg = String.format("Failed to send request for %s 
to %s: %s", streamId,
@@ -236,7 +236,8 @@ public class TransportClient implements Closeable {
   if (future.isSuccess()) {
 long timeTaken = System.currentTimeMillis() - startTime;
 if (logger.isTraceEnabled()) {
-  logger.trace("Sending request {} to {} took {} ms", requestId, 
getRemoteAddress(channel

spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

2016-09-21 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/master 248922fd4 -> d7ee12211


[SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

This patch updates the `kinesis-asl-assembly` build to prevent that module from 
being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library 
(KCL) and Kinesis Producer Library (KPL), both of which are licensed under the 
Amazon Software License and are therefore prohibited from being distributed in 
Apache releases.

Author: Josh Rosen 

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7ee1221
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7ee1221
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7ee1221

Branch: refs/heads/master
Commit: d7ee12211a99efae6f7395e47089236838461d61
Parents: 248922f
Author: Josh Rosen 
Authored: Wed Sep 21 11:38:10 2016 -0700
Committer: Josh Rosen 
Committed: Wed Sep 21 11:38:10 2016 -0700

--
 external/kinesis-asl-assembly/pom.xml | 15 +++
 1 file changed, 15 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7ee1221/external/kinesis-asl-assembly/pom.xml
--
diff --git a/external/kinesis-asl-assembly/pom.xml 
b/external/kinesis-asl-assembly/pom.xml
index df528b3..f7cb764 100644
--- a/external/kinesis-asl-assembly/pom.xml
+++ b/external/kinesis-asl-assembly/pom.xml
@@ -141,6 +141,21 @@
   
target/scala-${scala.binary.version}/classes
   
target/scala-${scala.binary.version}/test-classes
   
+
+
+  org.apache.maven.plugins
+  maven-deploy-plugin
+  
+true
+  
+
+
+  org.apache.maven.plugins
+  maven-install-plugin
+  
+true
+  
+
 
   org.apache.maven.plugins
   maven-shade-plugin


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

2016-09-21 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 45bccdd9c -> cd0bd89d7


[SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

This patch updates the `kinesis-asl-assembly` build to prevent that module from 
being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library 
(KCL) and Kinesis Producer Library (KPL), both of which are licensed under the 
Amazon Software License and are therefore prohibited from being distributed in 
Apache releases.

Author: Josh Rosen 

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.

(cherry picked from commit d7ee12211a99efae6f7395e47089236838461d61)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cd0bd89d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cd0bd89d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cd0bd89d

Branch: refs/heads/branch-2.0
Commit: cd0bd89d7852bab5adfee4b1b53c87efbf95176a
Parents: 45bccdd
Author: Josh Rosen 
Authored: Wed Sep 21 11:38:10 2016 -0700
Committer: Josh Rosen 
Committed: Wed Sep 21 11:38:55 2016 -0700

--
 external/kinesis-asl-assembly/pom.xml | 15 +++
 1 file changed, 15 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cd0bd89d/external/kinesis-asl-assembly/pom.xml
--
diff --git a/external/kinesis-asl-assembly/pom.xml 
b/external/kinesis-asl-assembly/pom.xml
index 58c57c1..8fc6fd9 100644
--- a/external/kinesis-asl-assembly/pom.xml
+++ b/external/kinesis-asl-assembly/pom.xml
@@ -142,6 +142,21 @@
   
target/scala-${scala.binary.version}/classes
   
target/scala-${scala.binary.version}/test-classes
   
+
+
+  org.apache.maven.plugins
+  maven-deploy-plugin
+  
+true
+  
+
+
+  org.apache.maven.plugins
+  maven-install-plugin
+  
+true
+  
+
 
   org.apache.maven.plugins
   maven-shade-plugin


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11918][ML] Better error from WLS for cases like singular input

2016-09-21 Thread dbtsai

Repository: spark
Updated Branches:
  refs/heads/master d7ee12211 -> b4a4421b6


[SPARK-11918][ML] Better error from WLS for cases like singular input

## What changes were proposed in this pull request?

Update error handling for Cholesky decomposition to provide a little more info 
when input is singular.

## How was this patch tested?

New test case; jenkins tests.

Author: Sean Owen 

Closes #15177 from srowen/SPARK-11918.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4a4421b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4a4421b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4a4421b

Branch: refs/heads/master
Commit: b4a4421b610e776e5280fd5e7453f937f806cbd1
Parents: d7ee122
Author: Sean Owen 
Authored: Wed Sep 21 18:56:16 2016 +
Committer: DB Tsai 
Committed: Wed Sep 21 18:56:16 2016 +

--
 .../mllib/linalg/CholeskyDecomposition.scala| 19 +++
 .../ml/optim/WeightedLeastSquaresSuite.scala| 20 
 2 files changed, 35 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b4a4421b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala
index e449479..08f8f19 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala
@@ -36,8 +36,7 @@ private[spark] object CholeskyDecomposition {
 val k = bx.length
 val info = new intW(0)
 lapack.dppsv("U", k, 1, A, bx, k, info)
-val code = info.`val`
-assert(code == 0, s"lapack.dppsv returned $code.")
+checkReturnValue(info, "dppsv")
 bx
   }
 
@@ -52,8 +51,20 @@ private[spark] object CholeskyDecomposition {
   def inverse(UAi: Array[Double], k: Int): Array[Double] = {
 val info = new intW(0)
 lapack.dpptri("U", k, UAi, info)
-val code = info.`val`
-assert(code == 0, s"lapack.dpptri returned $code.")
+checkReturnValue(info, "dpptri")
 UAi
   }
+
+  private def checkReturnValue(info: intW, method: String): Unit = {
+info.`val` match {
+  case code if code < 0 =>
+throw new IllegalStateException(s"LAPACK.$method returned $code; arg 
${-code} is illegal")
+  case code if code > 0 =>
+throw new IllegalArgumentException(
+  s"LAPACK.$method returned $code because A is not positive definite. 
Is A derived from " +
+  "a singular matrix (e.g. collinear column values)?")
+  case _ => // do nothing
+}
+  }
+
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/b4a4421b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
index c8de796..2cb1af0 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
@@ -60,6 +60,26 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ), 2)
   }
 
+  test("two collinear features result in error with no regularization") {
+val singularInstances = sc.parallelize(Seq(
+  Instance(1.0, 1.0, Vectors.dense(1.0, 2.0)),
+  Instance(2.0, 1.0, Vectors.dense(2.0, 4.0)),
+  Instance(3.0, 1.0, Vectors.dense(3.0, 6.0)),
+  Instance(4.0, 1.0, Vectors.dense(4.0, 8.0))
+), 2)
+
+intercept[IllegalArgumentException] {
+  new WeightedLeastSquares(
+false, regParam = 0.0, standardizeFeatures = false,
+standardizeLabel = false).fit(singularInstances)
+}
+
+// Should not throw an exception
+new WeightedLeastSquares(
+  false, regParam = 1.0, standardizeFeatures = false,
+  standardizeLabel = false).fit(singularInstances)
+  }
+
   test("WLS against lm") {
 /*
R code:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

2016-09-21 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 8f88412c3 -> ce0a222f5


[SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published

This patch updates the `kinesis-asl-assembly` build to prevent that module from 
being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library 
(KCL) and Kinesis Producer Library (KPL), both of which are licensed under the 
Amazon Software License and are therefore prohibited from being distributed in 
Apache releases.

Author: Josh Rosen 

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce0a222f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce0a222f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce0a222f

Branch: refs/heads/branch-1.6
Commit: ce0a222f56ffaf85273d2935b3e6d02aa9f6fa48
Parents: 8f88412
Author: Josh Rosen 
Authored: Wed Sep 21 11:38:10 2016 -0700
Committer: Josh Rosen 
Committed: Wed Sep 21 11:42:48 2016 -0700

--
 extras/kinesis-asl-assembly/pom.xml | 15 +++
 1 file changed, 15 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ce0a222f/extras/kinesis-asl-assembly/pom.xml
--
diff --git a/extras/kinesis-asl-assembly/pom.xml 
b/extras/kinesis-asl-assembly/pom.xml
index 98d6d8d..6528e4e 100644
--- a/extras/kinesis-asl-assembly/pom.xml
+++ b/extras/kinesis-asl-assembly/pom.xml
@@ -132,6 +132,21 @@
   
target/scala-${scala.binary.version}/classes
   
target/scala-${scala.binary.version}/test-classes
   
+
+
+  org.apache.maven.plugins
+  maven-deploy-plugin
+  
+true
+  
+
+
+  org.apache.maven.plugins
+  maven-install-plugin
+  
+true
+  
+
 
   org.apache.maven.plugins
   maven-shade-plugin


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-4563][CORE] Allow driver to advertise a different network address.

2016-09-21 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master b4a4421b6 -> 2cd1bfa4f


[SPARK-4563][CORE] Allow driver to advertise a different network address.

The goal of this feature is to allow the Spark driver to run in an
isolated environment, such as a docker container, and be able to use
the host's port forwarding mechanism to be able to accept connections
from the outside world.

The change is restricted to the driver: there is no support for achieving
the same thing on executors (or the YARN AM for that matter). Those still
need full access to the outside world so that, for example, connections
can be made to an executor's block manager.

The core of the change is simple: add a new configuration that tells what's
the address the driver should bind to, which can be different than the address
it advertises to executors (spark.driver.host). Everything else is plumbing
the new configuration where it's needed.

To use the feature, the host starting the container needs to set up the
driver's port range to fall into a range that is being forwarded; this
required the block manager port to need a special configuration just for
the driver, which falls back to the existing spark.blockManager.port when
not set. This way, users can modify the driver settings without affecting
the executors; it would theoretically be nice to also have different
retry counts for driver and executors, but given that docker (at least)
allows forwarding port ranges, we can probably live without that for now.

Because of the nature of the feature it's kinda hard to add unit tests;
I just added a simple one to make sure the configuration works.

This was tested with a docker image running spark-shell with the following
command:

 docker blah blah blah \
   -p 38000-38100:38000-38100 \
   [image] \
   spark-shell \
 --num-executors 3 \
 --conf spark.shuffle.service.enabled=false \
 --conf spark.dynamicAllocation.enabled=false \
 --conf spark.driver.host=[host's address] \
 --conf spark.driver.port=38000 \
 --conf spark.driver.blockManager.port=38020 \
 --conf spark.ui.port=38040

Running on YARN; verified the driver works, executors start up and listen
on ephemeral ports (instead of using the driver's config), and that caching
and shuffling (without the shuffle service) works. Clicked through the UI
to make sure all pages (including executor thread dumps) worked. Also tested
apps without docker, and ran unit tests.

Author: Marcelo Vanzin 

Closes #15120 from vanzin/SPARK-4563.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2cd1bfa4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2cd1bfa4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2cd1bfa4

Branch: refs/heads/master
Commit: 2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
Parents: b4a4421
Author: Marcelo Vanzin 
Authored: Wed Sep 21 14:42:41 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Sep 21 14:42:41 2016 -0700

--
 .../main/scala/org/apache/spark/SparkConf.scala |  2 ++
 .../scala/org/apache/spark/SparkContext.scala   |  5 ++--
 .../main/scala/org/apache/spark/SparkEnv.scala  | 27 +++-
 .../spark/internal/config/ConfigProvider.scala  |  2 +-
 .../apache/spark/internal/config/package.scala  | 20 +++
 .../netty/NettyBlockTransferService.scala   |  7 ++---
 .../scala/org/apache/spark/rpc/RpcEnv.scala | 17 ++--
 .../apache/spark/rpc/netty/NettyRpcEnv.scala|  9 ---
 .../main/scala/org/apache/spark/ui/WebUI.scala  |  5 ++--
 .../scala/org/apache/spark/util/Utils.scala |  6 ++---
 .../netty/NettyBlockTransferSecuritySuite.scala |  6 +++--
 .../netty/NettyBlockTransferServiceSuite.scala  |  5 ++--
 .../spark/rpc/netty/NettyRpcEnvSuite.scala  | 16 ++--
 .../storage/BlockManagerReplicationSuite.scala  |  2 +-
 .../spark/storage/BlockManagerSuite.scala   |  4 +--
 docs/configuration.md   | 23 -
 .../cluster/mesos/MesosSchedulerUtils.scala |  3 ++-
 ...esosCoarseGrainedSchedulerBackendSuite.scala |  5 ++--
 .../mesos/MesosSchedulerUtilsSuite.scala|  3 ++-
 .../spark/streaming/CheckpointSuite.scala   |  4 ++-
 .../streaming/ReceivedBlockHandlerSuite.scala   |  2 +-
 21 files changed, 133 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2cd1bfa4/core/src/main/scala/org/apache/spark/SparkConf.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index e85e5aa..51a699f4 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -422,6 +422,8 @@ class SparkConf(loadDefaults: Bool

spark git commit: [SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task.

2016-09-21 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 2cd1bfa4f -> 9fcf1c51d


[SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task.

## What changes were proposed in this pull request?

In TaskResultGetter, enqueueFailedTask currently deserializes the result
as a TaskEndReason. But the type is actually more specific, its a
TaskFailedReason. This just leads to more blind casting later on â it
would be more clear if the msg was cast to the right type immediately,
so method parameter types could be tightened.

## How was this patch tested?

Existing unit tests via jenkins.  Note that the code was already performing a 
blind-cast to a TaskFailedReason before in any case, just in a different spot, 
so there shouldn't be any behavior change.

Author: Imran Rashid 

Closes #15181 from squito/SPARK-17623.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9fcf1c51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9fcf1c51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9fcf1c51

Branch: refs/heads/master
Commit: 9fcf1c51d518847eda7f5ea71337cfa7def3c45c
Parents: 2cd1bfa
Author: Imran Rashid 
Authored: Wed Sep 21 17:49:36 2016 -0400
Committer: Andrew Or 
Committed: Wed Sep 21 17:49:36 2016 -0400

--
 .../apache/spark/executor/CommitDeniedException.scala   |  4 ++--
 .../main/scala/org/apache/spark/executor/Executor.scala |  4 ++--
 .../org/apache/spark/scheduler/TaskResultGetter.scala   |  4 ++--
 .../org/apache/spark/scheduler/TaskSchedulerImpl.scala  |  2 +-
 .../org/apache/spark/scheduler/TaskSetManager.scala | 12 +++-
 .../org/apache/spark/shuffle/FetchFailedException.scala |  4 ++--
 .../scala/org/apache/spark/util/JsonProtocolSuite.scala |  2 +-
 7 files changed, 13 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala 
b/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala
index 7d84889..326e042 100644
--- a/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala
+++ b/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.executor
 
-import org.apache.spark.{TaskCommitDenied, TaskEndReason}
+import org.apache.spark.{TaskCommitDenied, TaskFailedReason}
 
 /**
  * Exception thrown when a task attempts to commit output to HDFS but is 
denied by the driver.
@@ -29,5 +29,5 @@ private[spark] class CommitDeniedException(
 attemptNumber: Int)
   extends Exception(msg) {
 
-  def toTaskEndReason: TaskEndReason = TaskCommitDenied(jobID, splitID, 
attemptNumber)
+  def toTaskFailedReason: TaskFailedReason = TaskCommitDenied(jobID, splitID, 
attemptNumber)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/executor/Executor.scala
--
diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index fbf2b86..668ec41 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -355,7 +355,7 @@ private[spark] class Executor(
 
   } catch {
 case ffe: FetchFailedException =>
-  val reason = ffe.toTaskEndReason
+  val reason = ffe.toTaskFailedReason
   setTaskFinishedAndClearInterruptStatus()
   execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
 
@@ -370,7 +370,7 @@ private[spark] class Executor(
   execBackend.statusUpdate(taskId, TaskState.KILLED, 
ser.serialize(TaskKilled))
 
 case CausedBy(cDE: CommitDeniedException) =>
-  val reason = cDE.toTaskEndReason
+  val reason = cDE.toTaskFailedReason
   setTaskFinishedAndClearInterruptStatus()
   execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
 

http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala 
b/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
index 685ef55..1c3fcbd 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
@@ -118,14 +118,14 @@ private[spark] class TaskResultGetter(sparkEnv

spark git commit: [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode

2016-09-21 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 9fcf1c51d -> 8c3ee2bc4


[SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster 
mode

## What changes were proposed in this pull request?

Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by 
their own mechanism, it is not necessary to check and format the python when 
running on these modes. This is a potential regression compared to 1.6, so here 
propose to fix it.

## How was this patch tested?

Unit test to verify SparkSubmit arguments, also with local cluster 
verification. Because of lack of `MiniDFSCluster` support in Spark unit test, 
there's no integration test added.

Author: jerryshao 

Closes #15137 from jerryshao/SPARK-17512.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8c3ee2bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8c3ee2bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8c3ee2bc

Branch: refs/heads/master
Commit: 8c3ee2bc42e6320b9341cebdba51a00162c897ea
Parents: 9fcf1c5
Author: jerryshao 
Authored: Wed Sep 21 17:57:21 2016 -0400
Committer: Andrew Or 
Committed: Wed Sep 21 17:57:21 2016 -0400

--
 .../org/apache/spark/deploy/SparkSubmit.scala| 13 ++---
 .../apache/spark/deploy/SparkSubmitSuite.scala   | 19 +++
 2 files changed, 29 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8c3ee2bc/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 7b6d5a3..8061165 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -311,7 +311,7 @@ object SparkSubmit {
 // In Mesos cluster mode, non-local python files are automatically 
downloaded by Mesos.
 if (args.isPython && !isYarnCluster && !isMesosCluster) {
   if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
-printErrorAndExit(s"Only local python files are supported: 
$args.primaryResource")
+printErrorAndExit(s"Only local python files are supported: 
${args.primaryResource}")
   }
   val nonLocalPyFiles = Utils.nonLocalPaths(args.pyFiles).mkString(",")
   if (nonLocalPyFiles.nonEmpty) {
@@ -322,7 +322,7 @@ object SparkSubmit {
 // Require all R files to be local
 if (args.isR && !isYarnCluster) {
   if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
-printErrorAndExit(s"Only local R files are supported: 
$args.primaryResource")
+printErrorAndExit(s"Only local R files are supported: 
${args.primaryResource}")
   }
 }
 
@@ -633,7 +633,14 @@ object SparkSubmit {
 // explicitly sets `spark.submit.pyFiles` in his/her default properties 
file.
 sysProps.get("spark.submit.pyFiles").foreach { pyFiles =>
   val resolvedPyFiles = Utils.resolveURIs(pyFiles)
-  val formattedPyFiles = 
PythonRunner.formatPaths(resolvedPyFiles).mkString(",")
+  val formattedPyFiles = if (!isYarnCluster && !isMesosCluster) {
+PythonRunner.formatPaths(resolvedPyFiles).mkString(",")
+  } else {
+// Ignoring formatting python path in yarn and mesos cluster mode, 
these two modes
+// support dealing with remote python files, they could distribute and 
add python files
+// locally.
+resolvedPyFiles
+  }
   sysProps("spark.submit.pyFiles") = formattedPyFiles
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8c3ee2bc/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
index 961ece3..31c8fb2 100644
--- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
@@ -582,6 +582,25 @@ class SparkSubmitSuite
 val sysProps3 = SparkSubmit.prepareSubmitEnvironment(appArgs3)._3
 sysProps3("spark.submit.pyFiles") should be(
   PythonRunner.formatPaths(Utils.resolveURIs(pyFiles)).mkString(","))
+
+// Test remote python files
+val f4 = File.createTempFile("test-submit-remote-python-files", "", tmpDir)
+val writer4 = new PrintWriter(f4)
+val remotePyFiles = "hdfs:///tmp/file1.py,hdfs:///tmp/file2.py"
+writer4.println("spark.submit.pyFiles " + remotePyFiles)
+writer4.close()
+val clArgs4 = Seq(
+  "--master", "yarn",
+  "--deploy-mode", "cluster",
+  "--properties-

spark git commit: [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode

2016-09-21 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 cd0bd89d7 -> 59e6ab11a


[SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster 
mode

## What changes were proposed in this pull request?

Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by 
their own mechanism, it is not necessary to check and format the python when 
running on these modes. This is a potential regression compared to 1.6, so here 
propose to fix it.

## How was this patch tested?

Unit test to verify SparkSubmit arguments, also with local cluster 
verification. Because of lack of `MiniDFSCluster` support in Spark unit test, 
there's no integration test added.

Author: jerryshao 

Closes #15137 from jerryshao/SPARK-17512.

(cherry picked from commit 8c3ee2bc42e6320b9341cebdba51a00162c897ea)
Signed-off-by: Andrew Or 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/59e6ab11
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/59e6ab11
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/59e6ab11

Branch: refs/heads/branch-2.0
Commit: 59e6ab11a9e27d30ae3477fdc03337ff5f8ab4ec
Parents: cd0bd89
Author: jerryshao 
Authored: Wed Sep 21 17:57:21 2016 -0400
Committer: Andrew Or 
Committed: Wed Sep 21 17:57:33 2016 -0400

--
 .../org/apache/spark/deploy/SparkSubmit.scala| 13 ++---
 .../apache/spark/deploy/SparkSubmitSuite.scala   | 19 +++
 2 files changed, 29 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/59e6ab11/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 7b6d5a3..8061165 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -311,7 +311,7 @@ object SparkSubmit {
 // In Mesos cluster mode, non-local python files are automatically 
downloaded by Mesos.
 if (args.isPython && !isYarnCluster && !isMesosCluster) {
   if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
-printErrorAndExit(s"Only local python files are supported: 
$args.primaryResource")
+printErrorAndExit(s"Only local python files are supported: 
${args.primaryResource}")
   }
   val nonLocalPyFiles = Utils.nonLocalPaths(args.pyFiles).mkString(",")
   if (nonLocalPyFiles.nonEmpty) {
@@ -322,7 +322,7 @@ object SparkSubmit {
 // Require all R files to be local
 if (args.isR && !isYarnCluster) {
   if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
-printErrorAndExit(s"Only local R files are supported: 
$args.primaryResource")
+printErrorAndExit(s"Only local R files are supported: 
${args.primaryResource}")
   }
 }
 
@@ -633,7 +633,14 @@ object SparkSubmit {
 // explicitly sets `spark.submit.pyFiles` in his/her default properties 
file.
 sysProps.get("spark.submit.pyFiles").foreach { pyFiles =>
   val resolvedPyFiles = Utils.resolveURIs(pyFiles)
-  val formattedPyFiles = 
PythonRunner.formatPaths(resolvedPyFiles).mkString(",")
+  val formattedPyFiles = if (!isYarnCluster && !isMesosCluster) {
+PythonRunner.formatPaths(resolvedPyFiles).mkString(",")
+  } else {
+// Ignoring formatting python path in yarn and mesos cluster mode, 
these two modes
+// support dealing with remote python files, they could distribute and 
add python files
+// locally.
+resolvedPyFiles
+  }
   sysProps("spark.submit.pyFiles") = formattedPyFiles
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/59e6ab11/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
index b2bc886..54693c1 100644
--- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
@@ -577,6 +577,25 @@ class SparkSubmitSuite
 val sysProps3 = SparkSubmit.prepareSubmitEnvironment(appArgs3)._3
 sysProps3("spark.submit.pyFiles") should be(
   PythonRunner.formatPaths(Utils.resolveURIs(pyFiles)).mkString(","))
+
+// Test remote python files
+val f4 = File.createTempFile("test-submit-remote-python-files", "", tmpDir)
+val writer4 = new PrintWriter(f4)
+val remotePyFiles = "hdfs:///tmp/file1.py,hdfs:///tmp/file2.py"
+writer4.println("spark.submit.pyFiles " + remotePyFiles)
+writer4.close()
+

spark git commit: [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster

2016-09-21 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 8c3ee2bc4 -> 7cbe21644


[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster

## What changes were proposed in this pull request?

While getting the batch for a `FileStreamSource` in StructuredStreaming, we 
know which files we must take specifically. We already have verified that they 
exist, and have committed them to a metadata log. When creating the 
FileSourceRelation however for an incremental execution, the code checks the 
existence of every single file once again!

When you have 100,000s of files in a folder, creating the first batch takes 2 
hours+ when working with S3! This PR disables that check

## How was this patch tested?

Added a unit test to `FileStreamSource`.

Author: Burak Yavuz 

Closes #15122 from brkyvz/SPARK-17569.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7cbe2164
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7cbe2164
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7cbe2164

Branch: refs/heads/master
Commit: 7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0
Parents: 8c3ee2b
Author: Burak Yavuz 
Authored: Wed Sep 21 17:12:52 2016 -0700
Committer: Shixiong Zhu 
Committed: Wed Sep 21 17:12:52 2016 -0700

--
 .../sql/execution/datasources/DataSource.scala  | 10 +++-
 .../execution/streaming/FileStreamSource.scala  |  3 +-
 .../streaming/FileStreamSourceSuite.scala   | 53 +++-
 3 files changed, 62 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index 93154bd..413976a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -316,8 +316,14 @@ case class DataSource(
   /**
* Create a resolved [[BaseRelation]] that can be used to read data from or 
write data into this
* [[DataSource]]
+   *
+   * @param checkFilesExist Whether to confirm that the files exist when 
generating the
+   *non-streaming file based datasource. 
StructuredStreaming jobs already
+   *list file existence, and when generating 
incremental jobs, the batch
+   *is considered as a non-streaming file based data 
source. Since we know
+   *that files already exist, we don't need to check 
them again.
*/
-  def resolveRelation(): BaseRelation = {
+  def resolveRelation(checkFilesExist: Boolean = true): BaseRelation = {
 val caseInsensitiveOptions = new CaseInsensitiveMap(options)
 val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
   // TODO: Throw when too much is given.
@@ -368,7 +374,7 @@ case class DataSource(
 throw new AnalysisException(s"Path does not exist: $qualified")
   }
   // Sufficient to check head of the globPath seq for non-glob scenario
-  if (!fs.exists(globPath.head)) {
+  if (checkFilesExist && !fs.exists(globPath.head)) {
 throw new AnalysisException(s"Path does not exist: 
${globPath.head}")
   }
   globPath

http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
index 0dc08b1..5ebc083 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
@@ -133,7 +133,8 @@ class FileStreamSource(
 userSpecifiedSchema = Some(schema),
 className = fileFormatClassName,
 options = sourceOptions.optionMapWithoutPath)
-Dataset.ofRows(sparkSession, 
LogicalRelation(newDataSource.resolveRelation()))
+Dataset.ofRows(sparkSession, LogicalRelation(newDataSource.resolveRelation(
+  checkFilesExist = false)))
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceSuite.scala
--
diff --git 
a

spark git commit: [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7cbe21644 -> c133907c5


[SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by 
executors

## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` 
or ```SparkContext.addFile()```. Meanwhile, users can get the added file by 
```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the 
requirements for some shared dependency files. For example, SparkR users can 
download third party R packages to driver firstly, add these files to the Spark 
job as dependency by this API and then each executor can install these packages 
by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15131 from yanboliang/spark-17577.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c133907c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c133907c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c133907c

Branch: refs/heads/master
Commit: c133907c5d9a6e6411b896b5e0cff48b2beff09f
Parents: 7cbe216
Author: Yanbo Liang 
Authored: Wed Sep 21 20:08:28 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 20:08:28 2016 -0700

--
 R/pkg/NAMESPACE |  3 ++
 R/pkg/R/context.R   | 48 
 R/pkg/inst/tests/testthat/test_context.R| 13 ++
 .../scala/org/apache/spark/SparkContext.scala   |  6 +--
 4 files changed, 67 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index a5e9cbd..267a38c 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -336,6 +336,9 @@ export("as.DataFrame",
"read.parquet",
"read.text",
"spark.lapply",
+   "spark.addFile",
+   "spark.getSparkFilesRootDirectory",
+   "spark.getSparkFiles",
"sql",
"str",
"tableToDF",

http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 13ade49..4793578 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -225,6 +225,54 @@ setCheckpointDir <- function(sc, dirName) {
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(dirName
 }
 
+#' Add a file or directory to be downloaded with this Spark job on every node.
+#'
+#' The path passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
+#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark 
jobs,
+#' use spark.getSparkFiles(fileName) to find its download location.
+#'
+#' @rdname spark.addFile
+#' @param path The path of the file to be added
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.addFile("~/myfile")
+#'}
+#' @note spark.addFile since 2.1.0
+spark.addFile <- function(path) {
+  sc <- getSparkContext()
+  invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path
+}
+
+#' Get the root directory that contains files added through spark.addFile.
+#'
+#' @rdname spark.getSparkFilesRootDirectory
+#' @return the root directory that contains files added through spark.addFile
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.getSparkFilesRootDirectory()
+#'}
+#' @note spark.getSparkFilesRootDirectory since 2.1.0
+spark.getSparkFilesRootDirectory <- function() {
+  callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+}
+
+#' Get the absolute path of a file added through spark.addFile.
+#'
+#' @rdname spark.getSparkFiles
+#' @param fileName The name of the file added through spark.addFile
+#' @return the absolute path of a file added through spark.addFile.
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.getSparkFiles("myfile")
+#'}
+#' @note spark.getSparkFiles since 2.1.0
+spark.getSparkFiles <- function(fileName) {
+  callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+}
+
 #' Run a function over a list of elements, distributing the computations with 
Spark
 #'
 #' Run a function over a list of elements, distributing the computations with 
Spark. Applies a

http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index 1ab7f31..0495418 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -166,3 +166,16 @@ test_that("spark

spark git commit: [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master c133907c5 -> 6902edab7


[SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test 
summary

## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that 
```print.summary.KSTest``` was implemented inappropriately and result in no 
effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with 
[```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284)
 of SparkR and 
[```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R)
 of native R.

BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since 
it's only wrappers of the summary output which has been checked. Another reason 
is that these comparison will output summary information to the test console, 
it will make the test output in a mess.

## How was this patch tested?
Existing test.

Author: Yanbo Liang 

Closes #15139 from yanboliang/spark-17315.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6902edab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6902edab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6902edab

Branch: refs/heads/master
Commit: 6902edab7e80e96e3f57cf80f26cefb209d4d63c
Parents: c133907
Author: Yanbo Liang 
Authored: Wed Sep 21 20:14:18 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 20:14:18 2016 -0700

--
 R/pkg/R/mllib.R| 16 +---
 R/pkg/inst/tests/testthat/test_mllib.R | 16 ++--
 2 files changed, 11 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 234b208..98db367 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -1398,20 +1398,22 @@ setMethod("summary", signature(object = "KSTest"),
 distParams <- unlist(callJMethod(jobj, "distParams"))
 degreesOfFreedom <- callJMethod(jobj, "degreesOfFreedom")
 
-list(p.value = pValue, statistic = statistic, nullHypothesis = 
nullHypothesis,
- nullHypothesis.name = distName, nullHypothesis.parameters = 
distParams,
- degreesOfFreedom = degreesOfFreedom)
+ans <- list(p.value = pValue, statistic = statistic, 
nullHypothesis = nullHypothesis,
+nullHypothesis.name = distName, 
nullHypothesis.parameters = distParams,
+degreesOfFreedom = degreesOfFreedom, jobj = jobj)
+class(ans) <- "summary.KSTest"
+ans
   })
 
 #  Prints the summary of KSTest
 
 #' @rdname spark.kstest
-#' @param x test result object of KSTest by \code{spark.kstest}.
+#' @param x summary object of KSTest returned by \code{summary}.
 #' @export
 #' @note print.summary.KSTest since 2.1.0
 print.summary.KSTest <- function(x, ...) {
-  jobj <- x@jobj
+  jobj <- x$jobj
   summaryStr <- callJMethod(jobj, "summary")
-  cat(summaryStr)
-  invisible(summaryStr)
+  cat(summaryStr, "\n")
+  invisible(x)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 5b1404c..24c40a8 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -760,13 +760,7 @@ test_that("spark.kstest", {
 
   expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4)
   expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4)
-
-  printStr <- print.summary.KSTest(testResult)
-  expect_match(printStr, paste0("Kolmogorov-Smirnov test summary:\\n",
-"degrees of freedom = 0 \\n",
-"statistic = 0.38208[0-9]* \\n",
-"pValue = 0.19849[0-9]* \\n",
-".*"), perl = TRUE)
+  expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 
   testResult <- spark.kstest(df, "test", "norm", -0.5)
   stats <- summary(testResult)
@@ -775,13 +769,7 @@ test_that("spark.kstes

spark git commit: [SPARK-17627] Mark Streaming Providers Experimental

2016-09-21 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 59e6ab11a -> 966abd6af


[SPARK-17627] Mark Streaming Providers Experimental

All of structured streaming is experimental in its first release.  We missed 
the annotation on two of the APIs.

Author: Michael Armbrust 

Closes #15188 from marmbrus/experimentalApi.

(cherry picked from commit 3497ebe511fee67e66387e9e737c843a2939ce45)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/966abd6a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/966abd6a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/966abd6a

Branch: refs/heads/branch-2.0
Commit: 966abd6af04b8e7b5f6446cba96f1825ca2bfcfa
Parents: 59e6ab1
Author: Michael Armbrust 
Authored: Wed Sep 21 20:59:46 2016 -0700
Committer: Reynold Xin 
Committed: Wed Sep 21 20:59:52 2016 -0700

--
 .../src/main/scala/org/apache/spark/sql/sources/interfaces.scala | 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/966abd6a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
index d2077a0..b84953d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
@@ -112,8 +112,10 @@ trait SchemaRelationProvider {
 }
 
 /**
+ * ::Experimental::
  * Implemented by objects that can produce a streaming [[Source]] for a 
specific format or system.
  */
+@Experimental
 trait StreamSourceProvider {
 
   /** Returns the name and schema of the source that can be used to 
continually read data. */
@@ -132,8 +134,10 @@ trait StreamSourceProvider {
 }
 
 /**
+ * ::Experimental::
  * Implemented by objects that can produce a streaming [[Sink]] for a specific 
format or system.
  */
+@Experimental
 trait StreamSinkProvider {
   def createSink(
   sqlContext: SQLContext,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17627] Mark Streaming Providers Experimental

2016-09-21 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 6902edab7 -> 3497ebe51


[SPARK-17627] Mark Streaming Providers Experimental

All of structured streaming is experimental in its first release.  We missed 
the annotation on two of the APIs.

Author: Michael Armbrust 

Closes #15188 from marmbrus/experimentalApi.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3497ebe5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3497ebe5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3497ebe5

Branch: refs/heads/master
Commit: 3497ebe511fee67e66387e9e737c843a2939ce45
Parents: 6902eda
Author: Michael Armbrust 
Authored: Wed Sep 21 20:59:46 2016 -0700
Committer: Reynold Xin 
Committed: Wed Sep 21 20:59:46 2016 -0700

--
 .../src/main/scala/org/apache/spark/sql/sources/interfaces.scala | 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3497ebe5/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
index a16d7ed..6484c78 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
@@ -112,8 +112,10 @@ trait SchemaRelationProvider {
 }
 
 /**
+ * ::Experimental::
  * Implemented by objects that can produce a streaming [[Source]] for a 
specific format or system.
  */
+@Experimental
 trait StreamSourceProvider {
 
   /** Returns the name and schema of the source that can be used to 
continually read data. */
@@ -132,8 +134,10 @@ trait StreamSourceProvider {
 }
 
 /**
+ * ::Experimental::
  * Implemented by objects that can produce a streaming [[Sink]] for a specific 
format or system.
  */
+@Experimental
 trait StreamSinkProvider {
   def createSink(
   sqlContext: SQLContext,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode

2016-09-21 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 3497ebe51 -> 8bde03bf9


[SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding 
mode

## What changes were proposed in this pull request?

Floor()/Ceil() of decimal is implemented using changePrecision() by passing a 
rounding mode, but the rounding mode is not respected when the decimal is in 
compact mode (could fit within a Long).

This Update the changePrecision() to respect rounding mode, which could be 
ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN.

## How was this patch tested?

Added regression tests.

Author: Davies Liu 

Closes #15154 from davies/decimal_round.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bde03bf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bde03bf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bde03bf

Branch: refs/heads/master
Commit: 8bde03bf9a0896ea59ceaa699df7700351a130fb
Parents: 3497ebe
Author: Davies Liu 
Authored: Wed Sep 21 21:02:30 2016 -0700
Committer: Reynold Xin 
Committed: Wed Sep 21 21:02:30 2016 -0700

--
 .../org/apache/spark/sql/types/Decimal.scala| 28 +---
 .../apache/spark/sql/types/DecimalSuite.scala   | 15 +++
 2 files changed, 39 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8bde03bf/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
index cc8175c..7085905 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
@@ -242,10 +242,30 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   if (scale < _scale) {
 // Easier case: we just need to divide our scale down
 val diff = _scale - scale
-val droppedDigits = longVal % POW_10(diff)
-longVal /= POW_10(diff)
-if (math.abs(droppedDigits) * 2 >= POW_10(diff)) {
-  longVal += (if (longVal < 0) -1L else 1L)
+val pow10diff = POW_10(diff)
+// % and / always round to 0
+val droppedDigits = longVal % pow10diff
+longVal /= pow10diff
+roundMode match {
+  case ROUND_FLOOR =>
+if (droppedDigits < 0) {
+  longVal += -1L
+}
+  case ROUND_CEILING =>
+if (droppedDigits > 0) {
+  longVal += 1L
+}
+  case ROUND_HALF_UP =>
+if (math.abs(droppedDigits) * 2 >= pow10diff) {
+  longVal += (if (droppedDigits < 0) -1L else 1L)
+}
+  case ROUND_HALF_EVEN =>
+val doubled = math.abs(droppedDigits) * 2
+if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 
0) {
+  longVal += (if (droppedDigits < 0) -1L else 1L)
+}
+  case _ =>
+sys.error(s"Not supported rounding mode: $roundMode")
 }
   } else if (scale > _scale) {
 // We might be able to multiply longVal by a power of 10 and not 
overflow, but if not,

http://git-wip-us.apache.org/repos/asf/spark/blob/8bde03bf/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
index a10c0e3..52d0692 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.types
 import org.scalatest.PrivateMethodTester
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.Decimal._
 
 class DecimalSuite extends SparkFunSuite with PrivateMethodTester {
   /** Check that a Decimal has the given string representation, precision and 
scale */
@@ -191,4 +192,18 @@ class DecimalSuite extends SparkFunSuite with 
PrivateMethodTester {
 assert(new Decimal().set(100L, 10, 0).toUnscaledLong === 100L)
 assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue)
   }
+
+  test("changePrecision() on compact decimal should respect rounding mode") {
+Seq(ROUND_FLOOR, ROUND_CEILING, ROUND_HALF_UP, ROUND_HALF_EVEN).foreach { 
mode =>
+  Seq("0.4", "0.5", "0.6", "1.0", "1.1", "1.6", "2.5", "5.5").foreach { n 
=>
+Seq("", "-").foreach { sign =>
+  val bd = BigDecimal(sign + n)
+  val unscale

spark git commit: [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode

2016-09-21 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 966abd6af -> ec377e773


[SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding 
mode

## What changes were proposed in this pull request?

Floor()/Ceil() of decimal is implemented using changePrecision() by passing a 
rounding mode, but the rounding mode is not respected when the decimal is in 
compact mode (could fit within a Long).

This Update the changePrecision() to respect rounding mode, which could be 
ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN.

## How was this patch tested?

Added regression tests.

Author: Davies Liu 

Closes #15154 from davies/decimal_round.

(cherry picked from commit 8bde03bf9a0896ea59ceaa699df7700351a130fb)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec377e77
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec377e77
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec377e77

Branch: refs/heads/branch-2.0
Commit: ec377e77307b477d20a642edcd5ad5e26b989de6
Parents: 966abd6
Author: Davies Liu 
Authored: Wed Sep 21 21:02:30 2016 -0700
Committer: Reynold Xin 
Committed: Wed Sep 21 21:02:42 2016 -0700

--
 .../org/apache/spark/sql/types/Decimal.scala| 28 +---
 .../apache/spark/sql/types/DecimalSuite.scala   | 15 +++
 2 files changed, 39 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ec377e77/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
index cc8175c..7085905 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
@@ -242,10 +242,30 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   if (scale < _scale) {
 // Easier case: we just need to divide our scale down
 val diff = _scale - scale
-val droppedDigits = longVal % POW_10(diff)
-longVal /= POW_10(diff)
-if (math.abs(droppedDigits) * 2 >= POW_10(diff)) {
-  longVal += (if (longVal < 0) -1L else 1L)
+val pow10diff = POW_10(diff)
+// % and / always round to 0
+val droppedDigits = longVal % pow10diff
+longVal /= pow10diff
+roundMode match {
+  case ROUND_FLOOR =>
+if (droppedDigits < 0) {
+  longVal += -1L
+}
+  case ROUND_CEILING =>
+if (droppedDigits > 0) {
+  longVal += 1L
+}
+  case ROUND_HALF_UP =>
+if (math.abs(droppedDigits) * 2 >= pow10diff) {
+  longVal += (if (droppedDigits < 0) -1L else 1L)
+}
+  case ROUND_HALF_EVEN =>
+val doubled = math.abs(droppedDigits) * 2
+if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 
0) {
+  longVal += (if (droppedDigits < 0) -1L else 1L)
+}
+  case _ =>
+sys.error(s"Not supported rounding mode: $roundMode")
 }
   } else if (scale > _scale) {
 // We might be able to multiply longVal by a power of 10 and not 
overflow, but if not,

http://git-wip-us.apache.org/repos/asf/spark/blob/ec377e77/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
index e1675c9..4cf329d 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala
@@ -22,6 +22,7 @@ import scala.language.postfixOps
 import org.scalatest.PrivateMethodTester
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.Decimal._
 
 class DecimalSuite extends SparkFunSuite with PrivateMethodTester {
   /** Check that a Decimal has the given string representation, precision and 
scale */
@@ -193,4 +194,18 @@ class DecimalSuite extends SparkFunSuite with 
PrivateMethodTester {
 assert(new Decimal().set(100L, 10, 0).toUnscaledLong === 100L)
 assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue)
   }
+
+  test("changePrecision() on compact decimal should respect rounding mode") {
+Seq(ROUND_FLOOR, ROUND_CEILING, ROUND_HALF_UP, ROUND_HALF_EVEN).foreach { 
mode =>
+  Seq("0.4", "0.5", "0.6", "1.0", "1.1", "1.6", "2.5", "5.5").foreach { n 
=>

spark git commit: Bump doc version for release 2.0.1.

2016-09-21 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ec377e773 -> 053b20a79


Bump doc version for release 2.0.1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/053b20a7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/053b20a7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/053b20a7

Branch: refs/heads/branch-2.0
Commit: 053b20a79c1824917c17405f30a7b91472311abe
Parents: ec377e7
Author: Reynold Xin 
Authored: Wed Sep 21 21:06:47 2016 -0700
Committer: Reynold Xin 
Committed: Wed Sep 21 21:06:47 2016 -0700

--
 docs/_config.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/053b20a7/docs/_config.yml
--
diff --git a/docs/_config.yml b/docs/_config.yml
index 3951cad..75c89bd 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -14,8 +14,8 @@ include:
 
 # These allow the documentation to be updated with newer releases
 # of Spark, Scala, and Mesos.
-SPARK_VERSION: 2.0.0
-SPARK_VERSION_SHORT: 2.0.0
+SPARK_VERSION: 2.0.1
+SPARK_VERSION_SHORT: 2.0.1
 SCALA_BINARY_VERSION: "2.11"
 SCALA_VERSION: "2.11.7"
 MESOS_VERSION: 0.21.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: Preparing Spark release v2.0.1-rc1

2016-09-21 Thread pwendell

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 053b20a79 -> e8b26be9b


Preparing Spark release v2.0.1-rc1


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00f2e28e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00f2e28e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00f2e28e

Branch: refs/heads/branch-2.0
Commit: 00f2e28edd5a74f75e8b4c58894eeb3a394649d7
Parents: 053b20a
Author: Patrick Wendell 
Authored: Wed Sep 21 21:09:08 2016 -0700
Committer: Patrick Wendell 
Committed: Wed Sep 21 21:09:08 2016 -0700

--
 assembly/pom.xml  | 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/java8-tests/pom.xml  | 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 repl/pom.xml  | 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 yarn/pom.xml  | 2 +-
 34 files changed, 34 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 507ddc7..6db3a59 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1-SNAPSHOT
+2.0.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index e170b9b..269b845 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1-SNAPSHOT
+2.0.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 8b832cf..20cf29e 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1-SNAPSHOT
+2.0.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-yarn/pom.xml
--
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 07d9f1c..25cc328 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1-SNAPSHOT
+2.0.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/sketch/pom.xml
--
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 5e02efd..37a5d09 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1-SNAPSHOT
+2.0.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/tags/pom.xml
--
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index e7fc6a2..ab287f3 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -22,7

[spark] Git Push Summary

2016-09-21 Thread pwendell

Repository: spark
Updated Tags:  refs/tags/v2.0.1-rc1 [created] 00f2e28ed

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Preparing development version 2.0.2-SNAPSHOT

2016-09-21 Thread pwendell

Preparing development version 2.0.2-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8b26be9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8b26be9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8b26be9

Branch: refs/heads/branch-2.0
Commit: e8b26be9bf2b7c46f4076b4d0597ea8451d3
Parents: 00f2e28
Author: Patrick Wendell 
Authored: Wed Sep 21 21:09:19 2016 -0700
Committer: Patrick Wendell 
Committed: Wed Sep 21 21:09:19 2016 -0700

--
 assembly/pom.xml  | 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/java8-tests/pom.xml  | 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 repl/pom.xml  | 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 yarn/pom.xml  | 2 +-
 34 files changed, 34 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 6db3a59..ca6daa2 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2.0.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 269b845..c727f54 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2.0.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 20cf29e..e335a89 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2.0.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-yarn/pom.xml
--
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 25cc328..8e64f56 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2.0.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/sketch/pom.xml
--
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 37a5d09..94c75d6 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2.0.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/tags/pom.xml
--
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index ab287f3..6ff14d2 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.0.1
+2

spark git commit: [SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 8bde03bf9 -> b50b34f56


[SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view

## What changes were proposed in this pull request?

After #15054 , there is no place in Spark SQL that need 
`SessionCatalog.tableExists` to check temp views, so this PR makes 
`SessionCatalog.tableExists` only check permanent table/view and removes some 
hacks.

This PR also improves the `getTempViewOrPermanentTableMetadata` that is 
introduced in  #15054 , to make the code simpler.

## How was this patch tested?

existing tests

Author: Wenchen Fan 

Closes #15160 from cloud-fan/exists.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b50b34f5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b50b34f5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b50b34f5

Branch: refs/heads/master
Commit: b50b34f5611a1f182ba9b6eaf86c666bbd9f9eb0
Parents: 8bde03b
Author: Wenchen Fan 
Authored: Thu Sep 22 12:52:09 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Sep 22 12:52:09 2016 +0800

--
 .../sql/catalyst/catalog/SessionCatalog.scala   | 70 ++--
 .../catalyst/catalog/SessionCatalogSuite.scala  | 30 +
 .../org/apache/spark/sql/DataFrameWriter.scala  |  9 +--
 .../command/createDataSourceTables.scala| 15 ++---
 .../spark/sql/execution/command/ddl.scala   | 43 +---
 .../spark/sql/execution/command/tables.scala| 17 +
 .../apache/spark/sql/internal/CatalogImpl.scala |  6 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala|  2 +-
 .../spark/sql/hive/execution/HiveDDLSuite.scala |  4 +-
 9 files changed, 81 insertions(+), 115 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b50b34f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index ef29c75..8c01c7a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -246,6 +246,16 @@ class SessionCatalog(
   }
 
   /**
+   * Return whether a table/view with the specified name exists. If no 
database is specified, check
+   * with current database.
+   */
+  def tableExists(name: TableIdentifier): Boolean = synchronized {
+val db = formatDatabaseName(name.database.getOrElse(currentDb))
+val table = formatTableName(name.table)
+externalCatalog.tableExists(db, table)
+  }
+
+  /**
* Retrieve the metadata of an existing permanent table/view. If no database 
is specified,
* assume the table/view is in the current database. If the specified 
table/view is not found
* in the database then a [[NoSuchTableException]] is thrown.
@@ -271,24 +281,6 @@ class SessionCatalog(
   }
 
   /**
-   * Retrieve the metadata of an existing temporary view or permanent 
table/view.
-   * If the temporary view does not exist, tries to get the metadata an 
existing permanent
-   * table/view. If no database is specified, assume the table/view is in the 
current database.
-   * If the specified table/view is not found in the database then a 
[[NoSuchTableException]] is
-   * thrown.
-   */
-  def getTempViewOrPermanentTableMetadata(name: String): CatalogTable = 
synchronized {
-val table = formatTableName(name)
-getTempView(table).map { plan =>
-  CatalogTable(
-identifier = TableIdentifier(table),
-tableType = CatalogTableType.VIEW,
-storage = CatalogStorageFormat.empty,
-schema = plan.output.toStructType)
-}.getOrElse(getTableMetadata(TableIdentifier(name)))
-  }
-
-  /**
* Load files stored in given path into an existing metastore table.
* If no database is specified, assume the table is in the current database.
* If the specified table is not found in the database then a 
[[NoSuchTableException]] is thrown.
@@ -369,6 +361,30 @@ class SessionCatalog(
   // -
 
   /**
+   * Retrieve the metadata of an existing temporary view or permanent 
table/view.
+   *
+   * If a database is specified in `name`, this will return the metadata of 
table/view in that
+   * database.
+   * If no database is specified, this will first attempt to get the metadata 
of a temporary view
+   * with the same name, then, if that does not exist, return the metadata of 
table/view in the
+   * current database.
+   */
+  def getTempViewOrPermanentTableMetadata(name: TableIdentifier): CatalogTable 
= synchronized {

spark git commit: [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master b50b34f56 -> cb324f611


[SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make 
ReuseExchange work in text format table

## What changes were proposed in this pull request?
The PR will override the `sameResult` in `HiveTableScanExec` to make 
`ReuseExchange` work in text format table.

## How was this patch tested?
# SQL
```sql
SELECT * FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
```

# Before
```
== Physical Plan ==
*BroadcastHashJoin [key#30], [key#34], Inner, BuildRight
:- *BroadcastHashJoin [key#30], [key#32], Inner, BuildRight
:  :- *Filter isnotnull(key#30)
:  :  +- HiveTableScan [key#30, value#31], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
: +- *Filter isnotnull(key#32)
:+- HiveTableScan [key#32, value#33], MetastoreRelation default, src
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- *Filter isnotnull(key#34)
  +- HiveTableScan [key#34, value#35], MetastoreRelation default, src
```

# After
```
== Physical Plan ==
*BroadcastHashJoin [key#2], [key#6], Inner, BuildRight
:- *BroadcastHashJoin [key#2], [key#4], Inner, BuildRight
:  :- *Filter isnotnull(key#2)
:  :  +- HiveTableScan [key#2, value#3], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
: +- *Filter isnotnull(key#4)
:+- HiveTableScan [key#4, value#5], MetastoreRelation default, src
+- ReusedExchange [key#6, value#7], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
```

cc: davies cloud-fan

Author: Yadong Qi 

Closes #14988 from watermen/SPARK-17425.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb324f61
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb324f61
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb324f61

Branch: refs/heads/master
Commit: cb324f61150c962aeabf0a779f6a09797b3d5072
Parents: b50b34f
Author: Yadong Qi 
Authored: Thu Sep 22 13:04:42 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Sep 22 13:04:42 2016 +0800

--
 .../spark/sql/hive/execution/HiveTableScanExec.scala | 15 +++
 1 file changed, 15 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cb324f61/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
index a716a3e..231f204 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
@@ -164,4 +164,19 @@ case class HiveTableScanExec(
   }
 
   override def output: Seq[Attribute] = attributes
+
+  override def sameResult(plan: SparkPlan): Boolean = plan match {
+case other: HiveTableScanExec =>
+  val thisPredicates = partitionPruningPred.map(cleanExpression)
+  val otherPredicates = other.partitionPruningPred.map(cleanExpression)
+
+  val result = relation.sameResult(other.relation) &&
+output.length == other.output.length &&
+  output.zip(other.output)
+.forall(p => p._1.name == p._2.name && p._1.dataType == 
p._2.dataType) &&
+  thisPredicates.length == otherPredicates.length &&
+thisPredicates.zip(otherPredicates).forall(p => 
p._1.semanticEquals(p._2))
+  result
+case _ => false
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master cb324f611 -> 3a80f92f8


[SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending 
SchemaRelationProvider

### What changes were proposed in this pull request?
For data sources without extending `SchemaRelationProvider`, we expect users to 
not specify schemas when they creating tables. If the schema is input from 
users, an exception is issued.

Since Spark 2.1, for any data source, to avoid infer the schema every time, we 
store the schema in the metastore catalog. Thus, when reading a cataloged data 
source table, the schema could be read from metastore catalog. In this case, we 
also got an exception. For example,

```Scala
sql(
  s"""
 |CREATE TABLE relationProvierWithSchema
 |USING org.apache.spark.sql.sources.SimpleScanSource
 |OPTIONS (
 |  From '1',
 |  To '10'
 |)
   """.stripMargin)
spark.table(tableName).show()
```
```
org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified 
schemas.;
```

This PR is to fix the above issue. When building a data source, we introduce a 
flag `isSchemaFromUsers` to indicate whether the schema is really input from 
users. If true, we issue an exception. Otherwise, we will call the 
`createRelation` of `RelationProvider` to generate the `BaseRelation`, in which 
it contains the actual schema.

### How was this patch tested?
Added a few cases.

Author: gatorsmile 

Closes #15046 from gatorsmile/tempViewCases.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a80f92f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a80f92f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a80f92f

Branch: refs/heads/master
Commit: 3a80f92f8f4b91d0a85724bca7d81c6f5bbb78fd
Parents: cb324f6
Author: gatorsmile 
Authored: Thu Sep 22 13:19:06 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Sep 22 13:19:06 2016 +0800

--
 .../sql/execution/datasources/DataSource.scala  |  9 ++-
 .../apache/spark/sql/sources/InsertSuite.scala  | 20 ++
 .../spark/sql/sources/TableScanSuite.scala  | 64 +---
 .../sql/test/DataFrameReaderWriterSuite.scala   | 33 ++
 4 files changed, 102 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3a80f92f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index 413976a..3206701 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -333,8 +333,13 @@ case class DataSource(
 dataSource.createRelation(sparkSession.sqlContext, 
caseInsensitiveOptions)
   case (_: SchemaRelationProvider, None) =>
 throw new AnalysisException(s"A schema needs to be specified when 
using $className.")
-  case (_: RelationProvider, Some(_)) =>
-throw new AnalysisException(s"$className does not allow user-specified 
schemas.")
+  case (dataSource: RelationProvider, Some(schema)) =>
+val baseRelation =
+  dataSource.createRelation(sparkSession.sqlContext, 
caseInsensitiveOptions)
+if (baseRelation.schema != schema) {
+  throw new AnalysisException(s"$className does not allow 
user-specified schemas.")
+}
+baseRelation
 
   // We are reading from the results of a streaming query. Load files from 
the metadata log
   // instead of listing them using HDFS APIs.

http://git-wip-us.apache.org/repos/asf/spark/blob/3a80f92f/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
index 6454d71..5eb5464 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
@@ -65,6 +65,26 @@ class InsertSuite extends DataSourceTest with 
SharedSQLContext {
 )
   }
 
+  test("insert into a temp view that does not point to an insertable data 
source") {
+import testImplicits._
+withTempView("t1", "t2") {
+  sql(
+"""
+  |CREATE TEMPORARY VIEW t1
+  |USING org.apache.spark.sql.sources.SimpleScanSource
+  |OPTIONS (
+  |  From '1',
+  |  To '10')
+""".stripMargin)
+  sparkContext.parallelize(1 to 10).

spark git commit: [SPARK-17625][SQL] set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 3a80f92f8 -> de7df7def


[SPARK-17625][SQL] set expectedOutputAttributes when converting 
SimpleCatalogRelation to LogicalRelation

## What changes were proposed in this pull request?

We should set expectedOutputAttributes when converting SimpleCatalogRelation to 
LogicalRelation, otherwise the outputs of LogicalRelation are different from 
outputs of SimpleCatalogRelation - they have different exprId's.

## How was this patch tested?

add a test case

Author: Zhenhua Wang 

Closes #15182 from wzhfy/expectedAttributes.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de7df7de
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de7df7de
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de7df7de

Branch: refs/heads/master
Commit: de7df7defc99e04fefd990974151a701f64b75b4
Parents: 3a80f92
Author: Zhenhua Wang 
Authored: Thu Sep 22 14:48:49 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Sep 22 14:48:49 2016 +0800

--
 .../execution/datasources/DataSourceStrategy.scala| 10 +++---
 .../scala/org/apache/spark/sql/DataFrameSuite.scala   | 14 +-
 2 files changed, 20 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/de7df7de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index c8ad5b3..63f01c5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -197,7 +197,10 @@ case class DataSourceAnalysis(conf: CatalystConf) extends 
Rule[LogicalPlan] {
  * source information.
  */
 class FindDataSourceTable(sparkSession: SparkSession) extends 
Rule[LogicalPlan] {
-  private def readDataSourceTable(sparkSession: SparkSession, table: 
CatalogTable): LogicalPlan = {
+  private def readDataSourceTable(
+  sparkSession: SparkSession,
+  simpleCatalogRelation: SimpleCatalogRelation): LogicalPlan = {
+val table = simpleCatalogRelation.catalogTable
 val dataSource =
   DataSource(
 sparkSession,
@@ -209,16 +212,17 @@ class FindDataSourceTable(sparkSession: SparkSession) 
extends Rule[LogicalPlan]
 
 LogicalRelation(
   dataSource.resolveRelation(),
+  expectedOutputAttributes = Some(simpleCatalogRelation.output),
   catalogTable = Some(table))
   }
 
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case i @ logical.InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _)
 if DDLUtils.isDatasourceTable(s.metadata) =>
-  i.copy(table = readDataSourceTable(sparkSession, s.metadata))
+  i.copy(table = readDataSourceTable(sparkSession, s))
 
 case s: SimpleCatalogRelation if DDLUtils.isDatasourceTable(s.metadata) =>
-  readDataSourceTable(sparkSession, s.metadata)
+  readDataSourceTable(sparkSession, s)
   }
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/de7df7de/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index c2d256b..2c60a7d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -26,7 +26,8 @@ import scala.util.Random
 import org.scalatest.Matchers._
 
 import org.apache.spark.SparkException
-import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Union}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Project, 
Union}
 import org.apache.spark.sql.execution.QueryExecution
 import org.apache.spark.sql.execution.aggregate.HashAggregateExec
 import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, 
ReusedExchangeExec, ShuffleExchange}
@@ -1585,4 +1586,15 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 val d = sampleDf.withColumn("c", 
monotonically_increasing_id).select($"c").collect
 assert(d.size == d.distinct.size)
   }
+
+  test("SPARK-17625: data source table in InMemoryCatalog should guarantee 
output consistency") {
+val tableName = "tbl"
+withTable(tableName) {
+  spark.range(10).select('id as 'i, 'id as 'j).write.saveAsTable(tableName)
+  val relation = 
spark.s

37 matches

Mail list logo