spark-website git commit: Add Israel Spark meetup to community page per request. Use https for meetup while we're here. Pick up a recent change to paper hyperlink reflected only in markdown, not HTML
Repository: spark-website Updated Branches: refs/heads/asf-site eee58685c -> 7c96b646e Add Israel Spark meetup to community page per request. Use https for meetup while we're here. Pick up a recent change to paper hyperlink reflected only in markdown, not HTML Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7c96b646 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7c96b646 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7c96b646 Branch: refs/heads/asf-site Commit: 7c96b646eb2de2dbe6aec91a82d86699e13c59c5 Parents: eee5868 Author: Sean Owen Authored: Wed Sep 21 08:32:16 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 08:32:16 2016 +0100 -- community.md| 57 +--- site/community.html | 57 +--- site/research.html | 2 +- 3 files changed, 61 insertions(+), 55 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/7c96b646/community.md -- diff --git a/community.md b/community.md index d856409..b0c5b3a 100644 --- a/community.md +++ b/community.md @@ -56,84 +56,87 @@ navigation: Spark Meetups are grass-roots events organized and hosted by leaders and champions in the community around the world. Check out http://spark.meetup.com";>http://spark.meetup.com to find a Spark meetup in your part of the world. Below is a partial list of Spark meetups. -http://www.meetup.com/spark-users/";>Bay Area Spark Meetup. +https://www.meetup.com/spark-users/";>Bay Area Spark Meetup. This group has been running since January 2012 in the San Francisco area. -The meetup page also contains an http://www.meetup.com/spark-users/events/past/";>archive of past meetups, including videos and http://www.meetup.com/spark-users/files/";>slides for most of the recent talks. +The meetup page also contains an https://www.meetup.com/spark-users/events/past/";>archive of past meetups, including videos and https://www.meetup.com/spark-users/files/";>slides for most of the recent talks. -http://www.meetup.com/Spark-Barcelona/";>Barcelona Spark Meetup +https://www.meetup.com/Spark-Barcelona/";>Barcelona Spark Meetup -http://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark Meetup +https://www.meetup.com/Spark_big_data_analytics/";>Bangalore Spark Meetup -http://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark Meetup +https://www.meetup.com/Berlin-Apache-Spark-Meetup/";>Berlin Spark Meetup -http://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark Meetup +https://www.meetup.com/spark-user-beijing-Meetup/";>Beijing Spark Meetup -http://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston Spark Meetup +https://www.meetup.com/Boston-Apache-Spark-User-Group/";>Boston Spark Meetup -http://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark Meetup +https://www.meetup.com/Boulder-Denver-Spark-Meetup/";>Boulder/Denver Spark Meetup -http://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark Users +https://www.meetup.com/Chicago-Spark-Users/";>Chicago Spark Users -http://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch Apache Spark Meetup +https://www.meetup.com/Christchurch-Apache-Spark-Meetup/";>Christchurch Apache Spark Meetup -http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati Apache Spark Meetup +https://www.meetup.com/Cincinnati-Apache-Spark-Meetup/";>Cincinanati Apache Spark Meetup -http://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou Spark Meetup +https://www.meetup.com/Hangzhou-Apache-Spark-Meetup/";>Hangzhou Spark Meetup -http://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad Spark Meetup +https://www.meetup.com/Spark-User-Group-Hyderabad/";>Hyderabad Spark Meetup -http://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana Spark Meetup +https://www.meetup.com/israel-spark-users/";>Israel Spark Users -http://www.meetup.com/Spark-London/";>London Spark Meetup +https://www.meetup.com/Apache-Spark-Ljubljana-Meetup/";>Ljubljana Spark Meetup -http://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark Meetup +https://www.meetup.com/Spark-London/";>London Spark Meetup -http://www.meetup.com/Mumbai-Spark-Meetup/";>Mumbai Spark Meetup +https://www.meetup.com/Apache-Spark-Maryland/";>Maryland Spark Meetup -http://www.meetup.com/Apache-Spark-in-Moscow/";>Moscow Spark Meetup +https://www.meetup.com/Mumbai-
spark git commit: [CORE][DOC] Fix errors in comments
Repository: spark Updated Branches: refs/heads/master e48ebc4e4 -> 61876a427 [CORE][DOC] Fix errors in comments ## What changes were proposed in this pull request? While reading source code of CORE and SQL core, I found some minor errors in comments such as extra space, missing blank line and grammar error. I fixed these minor errors and might find more during my source code study. ## How was this patch tested? Manually build Author: wm...@hotmail.com Closes #15151 from wangmiao1981/mem. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/61876a42 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/61876a42 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/61876a42 Branch: refs/heads/master Commit: 61876a42793bde0da90f54b44255148ed54b7f61 Parents: e48ebc4 Author: wm...@hotmail.com Authored: Wed Sep 21 09:33:29 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 09:33:29 2016 +0100 -- core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +- sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala -- diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala index cae7c9e..f255f5b 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala @@ -28,7 +28,7 @@ import org.apache.spark.util.Utils * :: DeveloperApi :: * This class represent an unique identifier for a BlockManager. * - * The first 2 constructors of this class is made private to ensure that BlockManagerId objects + * The first 2 constructors of this class are made private to ensure that BlockManagerId objects * can be created only using the apply method in the companion object. This allows de-duplication * of ID objects. Also, constructor parameters are private to ensure that parameters cannot be * modified from outside this class. http://git-wip-us.apache.org/repos/asf/spark/blob/61876a42/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index 0f6292d..6d7ac0f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -937,7 +937,7 @@ object SparkSession { } /** - * Return true if Hive classes can be loaded, otherwise false. + * @return true if Hive classes can be loaded, otherwise false. */ private[spark] def hiveClassesArePresent: Boolean = { try { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively
Repository: spark Updated Branches: refs/heads/master 61876a427 -> d3b886976 [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively ## What changes were proposed in this pull request? Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported. ## How was this patch tested? Unit test. Author: Yanbo Liang Closes #15140 from yanboliang/spark-17585. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3b88697 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3b88697 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3b88697 Branch: refs/heads/master Commit: d3b88697638dcf32854fe21a6c53dfb3782773b9 Parents: 61876a4 Author: Yanbo Liang Authored: Wed Sep 21 01:37:03 2016 -0700 Committer: Yanbo Liang Committed: Wed Sep 21 01:37:03 2016 -0700 -- .../spark/api/java/JavaSparkContext.scala | 13 + python/pyspark/context.py | 7 +-- python/pyspark/tests.py | 20 +++- python/test_support/hello.txt | 1 - python/test_support/hello/hello.txt | 1 + .../test_support/hello/sub_hello/sub_hello.txt | 1 + 6 files changed, 35 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala index 131f36f..4e50c26 100644 --- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala +++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { +sc.addFile(path, recursive) + } + + /** * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported * filesystems), or an HTTP, HTTPS or FTP URI. http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/context.py -- diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 5c32f8e..7a7f59c 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -767,7 +767,7 @@ class SparkContext(object): SparkContext._next_accum_id += 1 return Accumulator(SparkContext._next_accum_id - 1, value, accum_param) -def addFile(self, path): +def addFile(self, path, recursive=False): """ Add a file to be downloaded with this Spark job on every node. The C{path} passed can be either a local file, a file in HDFS @@ -778,6 +778,9 @@ class SparkContext(object): L{SparkFiles.get(fileName)} with the filename to find its download location. +A directory can be given if the recursive option is set to True. +Currently directories are only supported for Hadoop-supported filesystems. + >>> from pyspark import SparkFiles >>> path = os.path.join(tempdir, "test.txt") >>> with open(path, "w") as testFile: @@ -790,7 +793,7 @@ class SparkContext(object): >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() [100, 200, 300, 400] """ -self._jsc.sc().addFile(path) +self._jsc.sc().addFile(path, recursive) def addPyFile(self, path): """ http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/tests.py -- diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index 0a029b6..b075691 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -409,13 +409,23 @@ class AddFileTests(PySparkTestCase): self.assertEqual("Hello World!"
spark git commit: [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel
Repository: spark Updated Branches: refs/heads/master d3b886976 -> 7654385f2 [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel ## What changes were proposed in this pull request? The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does. ## How was this patch tested? This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`: ``` object W2VTiming { import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.feature.Word2VecModel def run(modelPath: String, scOpt: Option[SparkContext] = None) { val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test"))) val model = Word2VecModel.load(sc, modelPath) val keys = model.getVectors.keys val start = System.currentTimeMillis for(key <- keys) { model.findSynonyms(key, 5) model.findSynonyms(key, 10) model.findSynonyms(key, 25) model.findSynonyms(key, 50) } val finish = System.currentTimeMillis println("run completed in " + (finish - start) + "ms") } } ``` I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the `num` argument to `findSynonyms` is very close to the vocabulary size, the new approach will have less of an advantage over the old one.) Author: William Benton Closes #15150 from willb/SPARK-17595. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7654385f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7654385f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7654385f Branch: refs/heads/master Commit: 7654385f268a3f481c4574ce47a19ab21155efd5 Parents: d3b8869 Author: William Benton Authored: Wed Sep 21 09:45:06 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 09:45:06 2016 +0100 -- .../org/apache/spark/mllib/feature/Word2Vec.scala | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7654385f/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala index 42ca966..2364d43 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala @@ -35,6 +35,7 @@ import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.util.{Loader, Saveable} import org.apache.spark.rdd._ import org.apache.spark.sql.SparkSession +import org.apache.spark.util.BoundedPriorityQueue import org.apache.spark.util.Utils import org.apache.spark.util.random.XORShiftRandom @@ -555,7 +556,7 @@ class Word2VecModel private[spark] ( num: Int, wordOpt: Option[String]): Array[(String, Double)] = { require(num > 0, "Number of similar words should > 0") -// TODO: optimize top-k + val fVector = vector.toArray.map(_.toFloat) val cosineVec = Array.fill[Float](numWords)(0) val alpha: Float = 1 @@ -580,10 +581,16 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +for(i <- cosVec.indices) { + pq += Tuple2(wordList(i), cosVec(i)) +} + +val scored = pq.toSeq.sortBy(-_._2) val filtered = wordOpt match { - case Some(w) => scored.take(num + 1).filter(tup => w != tup._1) + case Some(w) => scored.filter(tup => w != tup._1) case None => scored } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value
Repository: spark Updated Branches: refs/heads/master 7654385f2 -> 3977223a3 [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show ++ | a| ++ |-6.0| ++ ``` ## How was this patch tested? Unit test. Author: Sean Zhong Closes #15171 from clockfly/SPARK-17617. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3977223a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3977223a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3977223a Branch: refs/heads/master Commit: 3977223a3268aaf6913a325ee459139a4a302b1c Parents: 7654385 Author: Sean Zhong Authored: Wed Sep 21 16:53:34 2016 +0800 Committer: Wenchen Fan Committed: Wed Sep 21 16:53:34 2016 +0800 -- .../spark/sql/catalyst/expressions/arithmetic.scala | 6 +- .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3977223a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index 13e539a..6f3db79 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -310,7 +310,11 @@ case class Remainder(left: Expression, right: Expression) if (input1 == null) { null } else { -integral.rem(input1, input2) +input1 match { + case d: Double => d % input2.asInstanceOf[java.lang.Double] + case f: Float => f % input2.asInstanceOf[java.lang.Float] + case _ => integral.rem(input1, input2) +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/3977223a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala index 5c98242..0d86efd 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala @@ -175,6 +175,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper } } + test("SPARK-17617: % (Remainder) double % double on super big double") { +val leftDouble = Literal(-5083676433652386516D) +val rightDouble = Literal(10D) +checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D) + +// Float has smaller precision +val leftFloat = Literal(-5083676433652386516F) +val rightFloat = Literal(10F) +checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F) + } + test("Abs") { testNumericDataTypes { convert => val input = Literal(convert(1)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value
Repository: spark Updated Branches: refs/heads/branch-2.0 726f05716 -> 65295bad9 [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show ++ | a| ++ |-6.0| ++ ``` ## How was this patch tested? Unit test. Author: Sean Zhong Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65295bad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65295bad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65295bad Branch: refs/heads/branch-2.0 Commit: 65295bad9b2a81f2394a52eedeb31da0ed2c4847 Parents: 726f057 Author: Sean Zhong Authored: Wed Sep 21 16:53:34 2016 +0800 Committer: Wenchen Fan Committed: Wed Sep 21 16:53:55 2016 +0800 -- .../spark/sql/catalyst/expressions/arithmetic.scala | 6 +- .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/65295bad/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index fa459aa..01c5d82 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -309,7 +309,11 @@ case class Remainder(left: Expression, right: Expression) if (input1 == null) { null } else { -integral.rem(input1, input2) +input1 match { + case d: Double => d % input2.asInstanceOf[java.lang.Double] + case f: Float => f % input2.asInstanceOf[java.lang.Float] + case _ => integral.rem(input1, input2) +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/65295bad/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala index 2e37887..069c3b3 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala @@ -173,6 +173,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper // } } + test("SPARK-17617: % (Remainder) double % double on super big double") { +val leftDouble = Literal(-5083676433652386516D) +val rightDouble = Literal(10D) +checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D) + +// Float has smaller precision +val leftFloat = Literal(-5083676433652386516F) +val rightFloat = Literal(10F) +checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F) + } + test("Abs") { testNumericDataTypes { convert => val input = Literal(convert(1)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value
Repository: spark Updated Branches: refs/heads/branch-1.6 8646b84fb -> 8f88412c3 [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show ++ | a| ++ |-6.0| ++ ``` ## How was this patch tested? Unit test. Author: Sean Zhong Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8f88412c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8f88412c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8f88412c Branch: refs/heads/branch-1.6 Commit: 8f88412c31dc840c15df9822638645381c82a2fe Parents: 8646b84 Author: Sean Zhong Authored: Wed Sep 21 16:53:34 2016 +0800 Committer: Wenchen Fan Committed: Wed Sep 21 16:57:44 2016 +0800 -- .../spark/sql/catalyst/expressions/arithmetic.scala | 6 +- .../catalyst/expressions/ArithmeticExpressionSuite.scala | 11 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8f88412c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index cfae285..7905825 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -273,7 +273,11 @@ case class Remainder(left: Expression, right: Expression) extends BinaryArithmet if (input1 == null) { null } else { -integral.rem(input1, input2) +input1 match { + case d: Double => d % input2.asInstanceOf[java.lang.Double] + case f: Float => f % input2.asInstanceOf[java.lang.Float] + case _ => integral.rem(input1, input2) +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/8f88412c/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala index 72285c6..a5930d4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala @@ -172,6 +172,17 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper // } } + test("SPARK-17617: % (Remainder) double % double on super big double") { +val leftDouble = Literal(-5083676433652386516D) +val rightDouble = Literal(10D) +checkEvaluation(Remainder(leftDouble, rightDouble), -6.0D) + +// Float has smaller precision +val leftFloat = Literal(-5083676433652386516F) +val rightFloat = Literal(10F) +checkEvaluation(Remainder(leftFloat, rightFloat), -2.0F) + } + test("Abs") { testNumericDataTypes { convert => val input = Literal(convert(1)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist
Repository: spark Updated Branches: refs/heads/master 3977223a3 -> 28fafa3ee [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist ## What changes were proposed in this pull request? The `ListingFileCatalog` lists files given a set of resolved paths. If a folder is deleted at any time between the paths were resolved and the file catalog can check for the folder, the Spark job fails. This may abruptly stop long running StructuredStreaming jobs for example. Folders may be deleted by users or automatically by retention policies. These cases should not prevent jobs from successfully completing. ## How was this patch tested? Unit test in `FileCatalogSuite` Author: Burak Yavuz Closes #15153 from brkyvz/SPARK-17599. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28fafa3e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28fafa3e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28fafa3e Branch: refs/heads/master Commit: 28fafa3ee8f3478fa441e7bd6c8fd4ab482ca98e Parents: 3977223 Author: Burak Yavuz Authored: Wed Sep 21 17:07:16 2016 +0800 Committer: Wenchen Fan Committed: Wed Sep 21 17:07:16 2016 +0800 -- .../sql/execution/datasources/ListingFileCatalog.scala | 12 ++-- .../sql/execution/datasources/FileCatalogSuite.scala| 11 +++ 2 files changed, 21 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/28fafa3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala index 60742bd..3253208 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala @@ -17,6 +17,8 @@ package org.apache.spark.sql.execution.datasources +import java.io.FileNotFoundException + import scala.collection.mutable import org.apache.hadoop.fs.{FileStatus, LocatedFileStatus, Path} @@ -97,8 +99,14 @@ class ListingFileCatalog( logTrace(s"Listing $path on driver") val childStatuses = { - val stats = fs.listStatus(path) - if (pathFilter != null) stats.filter(f => pathFilter.accept(f.getPath)) else stats + try { +val stats = fs.listStatus(path) +if (pathFilter != null) stats.filter(f => pathFilter.accept(f.getPath)) else stats + } catch { +case _: FileNotFoundException => + logWarning(s"The directory $path was not found. Was it deleted very recently?") + Array.empty[FileStatus] + } } childStatuses.map { http://git-wip-us.apache.org/repos/asf/spark/blob/28fafa3e/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala index 0d9ea51..5c8d322 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala @@ -67,4 +67,15 @@ class FileCatalogSuite extends SharedSQLContext { } } + + test("ListingFileCatalog: folders that don't exist don't throw exceptions") { +withTempDir { dir => + val deletedFolder = new File(dir, "deleted") + assert(!deletedFolder.exists()) + val catalog1 = new ListingFileCatalog( +spark, Seq(new Path(deletedFolder.getCanonicalPath)), Map.empty, None) + // doesn't throw an exception + assert(catalog1.listLeafFiles(catalog1.paths).isEmpty) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test
Repository: spark Updated Branches: refs/heads/master 28fafa3ee -> b366f1849 [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? Add Scala ut Author: Peng, Meng Closes #14597 from mpjlu/fprChiSquare. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b366f184 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b366f184 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b366f184 Branch: refs/heads/master Commit: b366f18496e1ce8bd20fe58a0245ef7d91819a03 Parents: 28fafa3 Author: Peng, Meng Authored: Wed Sep 21 10:17:38 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 10:17:38 2016 +0100 -- .../apache/spark/ml/feature/ChiSqSelector.scala | 69 - .../spark/mllib/api/python/PythonMLLibAPI.scala | 28 - .../spark/mllib/feature/ChiSqSelector.scala | 103 ++- .../spark/ml/feature/ChiSqSelectorSuite.scala | 11 +- .../mllib/feature/ChiSqSelectorSuite.scala | 18 project/MimaExcludes.scala | 3 + python/pyspark/mllib/feature.py | 71 - 7 files changed, 262 insertions(+), 41 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b366f184/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala index 1482eb3..0c6a37b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala @@ -27,6 +27,7 @@ import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util._ import org.apache.spark.mllib.feature +import org.apache.spark.mllib.feature.ChiSqSelectorType import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} import org.apache.spark.rdd.RDD @@ -54,11 +55,47 @@ private[feature] trait ChiSqSelectorParams extends Params /** @group getParam */ def getNumTopFeatures: Int = $(numTopFeatures) + + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by statistics value descending.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + def getPercentile: Double = $(percentile) + + final val alpha = new DoubleParam(this, "alpha", +"The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(alpha -> 0.05) + + /** @group getParam */ + def getAlpha: Double = $(alpha) + + /** + * The ChiSqSelector supports KBest, Percentile, FPR selection, + * which is the same as ChiSqSelectorType defined in MLLIB. + * when call setNumTopFeatures, the selectorType is set to KBest + * when call setPercentile, the selectorType is set to Percentile + * when call setAlpha, the selectorType is set to FPR + */ + final val selectorType = new Param[String](this, "selectorType", +"ChiSqSelector Type: KBest, Percentile, FPR") + setDefault(selectorType -> ChiSqSelectorType.KBest.toString) + + /** @group getParam */ + def getChiSqSelectorType: String = $(selectorType) } /** * Chi-Squared feature selection, which selects categorical features to use for predicting a * categorical label. + * The selector supports three selection methods: `KBest`, `Percentile` and `FPR`. + * `KBest` chooses the `k` top features according to a chi-squared test. + * `Percentile` is similar but chooses a fraction of all features instead of a fixed number. + * `FPR` chooses all features whose false positive rate meets some threshold. + * By default, the selection method is `KBest`, the default number of top features is 50. + * User can use setNumTopFeatures, setPercentile and setAlpha to set different selection methods. */ @Since("1.6.0") final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String) @@ -69,7 +106,22 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str /** @group setParam */ @Sin
spark git commit: [SPARK-17219][ML] Add NaN value handling in Bucketizer
Repository: spark Updated Branches: refs/heads/master b366f1849 -> 57dc326bd [SPARK-17219][ML] Add NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh Author: VinceShieh Closes #14858 from VinceShieh/spark-17219. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57dc326b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57dc326b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57dc326b Branch: refs/heads/master Commit: 57dc326bd00cf0a49da971e9c573c48ae28acaa2 Parents: b366f18 Author: VinceShieh Authored: Wed Sep 21 10:20:57 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 10:20:57 2016 +0100 -- docs/ml-features.md | 6 +++- .../apache/spark/ml/feature/Bucketizer.scala| 13 +--- .../spark/ml/feature/QuantileDiscretizer.scala | 9 -- .../spark/ml/feature/BucketizerSuite.scala | 31 .../ml/feature/QuantileDiscretizerSuite.scala | 29 +++--- python/pyspark/ml/feature.py| 5 .../spark/sql/DataFrameStatFunctions.scala | 4 ++- 7 files changed, 85 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 746593f..a39b31c 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1102,7 +1102,11 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned -categorical features. The number of bins is set by the `numBuckets` parameter. +categorical features. The number of bins is set by the `numBuckets` parameter. It is possible +that the number of buckets used will be less than this value, for example, if there are too few +distinct values of the input to create enough distinct quantiles. Note also that NaN values are +handled specially and placed into their own bucket. For example, if 4 buckets are used, then +non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a detailed description). The precision of the approximation can be controlled with the http://git-wip-us.apache.org/repos/asf/spark/blob/57dc326b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala index 100d9e7..ec0ea05 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala @@ -106,7 +106,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.6.0") object Bucketizer extends DefaultParamsReadable[Bucketizer] { - /** We require splits to be of length >= 3 and to be in strictly increasing order. */ + /** + * We require splits to be of length >= 3 and to be in strictly increasing order. + * No NaN split should be accepted. + */ private[feature] def checkSplits(splits: Array[Double]): Boolean = { if (splits.length < 3) { false @@ -114,10 +117,10 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { var i = 0 val n = splits.length - 1 while (i < n) { -if (splits(i) >= splits(i + 1)) return false +if (splits(i) >= splits(i + 1) || splits(i).isNaN) return false i += 1 } - true + !splits(n).isNaN } } @@ -126,7 +129,9 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets(splits: Array[Double], feature: Double): Double = { -if (feature == splits.last) { +if (feature.isNaN) { + splits.length - 1 +} else if (feature == splits.last) { splits.length - 2
spark git commit: [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
Repository: spark Updated Branches: refs/heads/master 57dc326bd -> 25a020be9 [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV ## What changes were proposed in this pull request? This PR includes the changes below: 1. Upgrade Univocity library from 2.1.1 to 2.2.1 This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases). 2. Remove useless `rowSeparator` variable existing in `CSVOptions` We have this unused variable in [CSVOptions.scala#L127](https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable. This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`. 3. Set the default value of `maxCharsPerColumn` to auto-expending. We are setting 100 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default. To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0). ## How was this patch tested? N/A Author: hyukjinkwon Closes #15138 from HyukjinKwon/SPARK-17583. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25a020be Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25a020be Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25a020be Branch: refs/heads/master Commit: 25a020be99b6a540e4001e59e40d5d1c8aa53812 Parents: 57dc326 Author: hyukjinkwon Authored: Wed Sep 21 10:35:29 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 10:35:29 2016 +0100 -- dev/deps/spark-deps-hadoop-2.2 | 2 +- dev/deps/spark-deps-hadoop-2.3 | 2 +- dev/deps/spark-deps-hadoop-2.4 | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- python/pyspark/sql/readwriter.py | 2 +- python/pyspark/sql/streaming.py | 2 +- sql/core/pom.xml | 2 +- .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala| 4 ++-- .../apache/spark/sql/execution/datasources/csv/CSVOptions.scala | 4 +--- .../apache/spark/sql/execution/datasources/csv/CSVParser.scala | 2 -- .../scala/org/apache/spark/sql/streaming/DataStreamReader.scala | 4 ++-- 12 files changed, 13 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.2 -- diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2 index a7259e2..f4f92c6 100644 --- a/dev/deps/spark-deps-hadoop-2.2 +++ b/dev/deps/spark-deps-hadoop-2.2 @@ -159,7 +159,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.1.1.jar +univocity-parsers-2.2.1.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xmlenc-0.52.jar http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.3 -- diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3 index 6986ab5..3db013f 100644 --- a/dev/deps/spark-deps-hadoop-2.3 +++ b/dev/deps/spark-deps-hadoop-2.3 @@ -167,7 +167,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.1.1.jar +univocity-parsers-2.2.1.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xmlenc-0.52.jar http://git-wip-us.apache.org/repos/asf/spark/blob/25a020be/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index 75cccb3..7171010 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -167,7 +167,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.1.1.jar +univocity-parsers-2.2.1.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xmlenc-0.52.jar http
spark git commit: [CORE][MINOR] Add minor code change to TaskState and Task
Repository: spark Updated Branches: refs/heads/master 25a020be9 -> dd7561d33 [CORE][MINOR] Add minor code change to TaskState and Task ## What changes were proposed in this pull request? - TaskState and ExecutorState expose isFailed and isFinished functions. It can be useful to add test coverage for different states. Currently, Other enums do not expose any functions so this PR aims just these two enums. - `private` access modifier is added for Finished Task States Set - A minor doc change is added. ## How was this patch tested? New Unit tests are added and run locally. Author: erenavsarogullari Closes #15143 from erenavsarogullari/SPARK-17584. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd7561d3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd7561d3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd7561d3 Branch: refs/heads/master Commit: dd7561d33761d119ded09cfba072147292bf6964 Parents: 25a020b Author: erenavsarogullari Authored: Wed Sep 21 14:47:18 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 14:47:18 2016 +0100 -- core/src/main/scala/org/apache/spark/TaskState.scala | 2 +- core/src/main/scala/org/apache/spark/scheduler/Task.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/TaskState.scala -- diff --git a/core/src/main/scala/org/apache/spark/TaskState.scala b/core/src/main/scala/org/apache/spark/TaskState.scala index cbace7b..596ce67 100644 --- a/core/src/main/scala/org/apache/spark/TaskState.scala +++ b/core/src/main/scala/org/apache/spark/TaskState.scala @@ -21,7 +21,7 @@ private[spark] object TaskState extends Enumeration { val LAUNCHING, RUNNING, FINISHED, FAILED, KILLED, LOST = Value - val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST) + private val FINISHED_STATES = Set(FINISHED, FAILED, KILLED, LOST) type TaskState = Value http://git-wip-us.apache.org/repos/asf/spark/blob/dd7561d3/core/src/main/scala/org/apache/spark/scheduler/Task.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/Task.scala b/core/src/main/scala/org/apache/spark/scheduler/Task.scala index 1ed36bf..ea9dc39 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/Task.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/Task.scala @@ -239,7 +239,7 @@ private[spark] object Task { * and return the task itself as a serialized ByteBuffer. The caller can then update its * ClassLoaders and deserialize the task. * - * @return (taskFiles, taskJars, taskBytes) + * @return (taskFiles, taskJars, taskProps, taskBytes) */ def deserializeWithDependencies(serializedTask: ByteBuffer) : (HashMap[String, Long], HashMap[String, Long], Properties, ByteBuffer) = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17590][SQL] Analyze CTE definitions at once and allow CTE subquery to define CTE
Repository: spark Updated Branches: refs/heads/master dd7561d33 -> 248922fd4 [SPARK-17590][SQL] Analyze CTE definitions at once and allow CTE subquery to define CTE ## What changes were proposed in this pull request? We substitute logical plan with CTE definitions in the analyzer rule CTESubstitution. A CTE definition can be used in the logical plan for multiple times, and its analyzed logical plan should be the same. We should not analyze CTE definitions multiple times when they are reused in the query. By analyzing CTE definitions before substitution, we can support defining CTE in subquery. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh Closes #15146 from viirya/cte-analysis-once. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/248922fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/248922fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/248922fd Branch: refs/heads/master Commit: 248922fd4fb7c11a40304431e8cc667a8911a906 Parents: dd7561d Author: Liang-Chi Hsieh Authored: Wed Sep 21 06:53:42 2016 -0700 Committer: Herman van Hovell Committed: Wed Sep 21 06:53:42 2016 -0700 -- .../apache/spark/sql/catalyst/parser/SqlBase.g4 | 2 +- .../spark/sql/catalyst/analysis/Analyzer.scala | 5 ++-- .../spark/sql/catalyst/parser/AstBuilder.scala | 2 +- .../org/apache/spark/sql/SubquerySuite.scala| 25 4 files changed, 29 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 -- diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 index 7023c0c..de2f9ee 100644 --- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 +++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 @@ -262,7 +262,7 @@ ctes ; namedQuery -: name=identifier AS? '(' queryNoWith ')' +: name=identifier AS? '(' query ')' ; tableProvider http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index cc62d5e..ae8869f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -116,15 +116,14 @@ class Analyzer( ) /** - * Substitute child plan with cte definitions + * Analyze cte definitions and substitute child plan with analyzed cte definitions. */ object CTESubstitution extends Rule[LogicalPlan] { -// TODO allow subquery to define CTE def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { case With(child, relations) => substituteCTE(child, relations.foldLeft(Seq.empty[(String, LogicalPlan)]) { case (resolved, (name, relation)) => -resolved :+ name -> ResolveRelations(substituteCTE(relation, resolved)) +resolved :+ name -> execute(substituteCTE(relation, resolved)) }) case other => other } http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 69d68fa..12a70b7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -108,7 +108,7 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { * This is only used for Common Table Expressions. */ override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = withOrigin(ctx) { -SubqueryAlias(ctx.name.getText, plan(ctx.queryNoWith), None) +SubqueryAlias(ctx.name.getText, plan(ctx.query), None) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/248922fd/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySui
spark git commit: [BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error
Repository: spark Updated Branches: refs/heads/branch-2.0 65295bad9 -> 45bccdd9c [BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error ## What changes were proposed in this pull request? This PR is to fix the code style errors. ## How was this patch tested? Manual. Before: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107). [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108). [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115). [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107). [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions. [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[184] (regexp) RegexpSingleline: No trailing whitespace allowed. [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[304] (regexp) RegexpSingleline: No trailing whitespace allowed. ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Weiqing Yang Closes #15175 from Sherry302/javastylefix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45bccdd9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45bccdd9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45bccdd9 Branch: refs/heads/branch-2.0 Commit: 45bccdd9c2b180323958db0f92ca8ee591e502ef Parents: 65295ba Author: Weiqing Yang Authored: Wed Sep 21 15:18:02 2016 +0100 Committer: Sean Owen Committed: Wed Sep 21 15:18:02 2016 +0100 -- .../org/apache/spark/network/client/TransportClient.java | 11 ++- .../spark/network/server/TransportRequestHandler.java| 7 --- .../org/apache/spark/network/util/LevelDBProvider.java | 2 +- .../apache/spark/network/yarn/YarnShuffleService.java| 4 ++-- 4 files changed, 13 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/45bccdd9/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java -- diff --git a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java index a67683b..17ac91d 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java +++ b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java @@ -150,8 +150,8 @@ public class TransportClient implements Closeable { if (future.isSuccess()) { long timeTaken = System.currentTimeMillis() - startTime; if (logger.isTraceEnabled()) { - logger.trace("Sending request {} to {} took {} ms", streamChunkId, getRemoteAddress(channel), -timeTaken); + logger.trace("Sending request {} to {} took {} ms", streamChunkId, +getRemoteAddress(channel), timeTaken); } } else { String errorMsg = String.format("Failed to send request %s to %s: %s", streamChunkId, @@ -193,8 +193,8 @@ public class TransportClient implements Closeable { if (future.isSuccess()) { long timeTaken = System.currentTimeMillis() - startTime; if (logger.isTraceEnabled()) { -logger.trace("Sending request for {} to {} took {} ms", streamId, getRemoteAddress(channel), - timeTaken); +logger.trace("Sending request for {} to {} took {} ms", streamId, + getRemoteAddress(channel), timeTaken); } } else { String errorMsg = String.format("Failed to send request for %s to %s: %s", streamId, @@ -236,7 +236,8 @@ public class TransportClient implements Closeable { if (future.isSuccess()) { long timeTaken = System.currentTimeMillis() - startTime; if (logger.isTraceEnabled()) { - logger.trace("Sending request {} to {} took {} ms", requestId, getRemoteAddress(channel
spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published
Repository: spark Updated Branches: refs/heads/master 248922fd4 -> d7ee12211 [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7ee1221 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7ee1221 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7ee1221 Branch: refs/heads/master Commit: d7ee12211a99efae6f7395e47089236838461d61 Parents: 248922f Author: Josh Rosen Authored: Wed Sep 21 11:38:10 2016 -0700 Committer: Josh Rosen Committed: Wed Sep 21 11:38:10 2016 -0700 -- external/kinesis-asl-assembly/pom.xml | 15 +++ 1 file changed, 15 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d7ee1221/external/kinesis-asl-assembly/pom.xml -- diff --git a/external/kinesis-asl-assembly/pom.xml b/external/kinesis-asl-assembly/pom.xml index df528b3..f7cb764 100644 --- a/external/kinesis-asl-assembly/pom.xml +++ b/external/kinesis-asl-assembly/pom.xml @@ -141,6 +141,21 @@ target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes + + + org.apache.maven.plugins + maven-deploy-plugin + +true + + + + org.apache.maven.plugins + maven-install-plugin + +true + + org.apache.maven.plugins maven-shade-plugin - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published
Repository: spark Updated Branches: refs/heads/branch-2.0 45bccdd9c -> cd0bd89d7 [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly. (cherry picked from commit d7ee12211a99efae6f7395e47089236838461d61) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cd0bd89d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cd0bd89d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cd0bd89d Branch: refs/heads/branch-2.0 Commit: cd0bd89d7852bab5adfee4b1b53c87efbf95176a Parents: 45bccdd Author: Josh Rosen Authored: Wed Sep 21 11:38:10 2016 -0700 Committer: Josh Rosen Committed: Wed Sep 21 11:38:55 2016 -0700 -- external/kinesis-asl-assembly/pom.xml | 15 +++ 1 file changed, 15 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cd0bd89d/external/kinesis-asl-assembly/pom.xml -- diff --git a/external/kinesis-asl-assembly/pom.xml b/external/kinesis-asl-assembly/pom.xml index 58c57c1..8fc6fd9 100644 --- a/external/kinesis-asl-assembly/pom.xml +++ b/external/kinesis-asl-assembly/pom.xml @@ -142,6 +142,21 @@ target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes + + + org.apache.maven.plugins + maven-deploy-plugin + +true + + + + org.apache.maven.plugins + maven-install-plugin + +true + + org.apache.maven.plugins maven-shade-plugin - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11918][ML] Better error from WLS for cases like singular input
Repository: spark Updated Branches: refs/heads/master d7ee12211 -> b4a4421b6 [SPARK-11918][ML] Better error from WLS for cases like singular input ## What changes were proposed in this pull request? Update error handling for Cholesky decomposition to provide a little more info when input is singular. ## How was this patch tested? New test case; jenkins tests. Author: Sean Owen Closes #15177 from srowen/SPARK-11918. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4a4421b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4a4421b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4a4421b Branch: refs/heads/master Commit: b4a4421b610e776e5280fd5e7453f937f806cbd1 Parents: d7ee122 Author: Sean Owen Authored: Wed Sep 21 18:56:16 2016 + Committer: DB Tsai Committed: Wed Sep 21 18:56:16 2016 + -- .../mllib/linalg/CholeskyDecomposition.scala| 19 +++ .../ml/optim/WeightedLeastSquaresSuite.scala| 20 2 files changed, 35 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b4a4421b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala index e449479..08f8f19 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/CholeskyDecomposition.scala @@ -36,8 +36,7 @@ private[spark] object CholeskyDecomposition { val k = bx.length val info = new intW(0) lapack.dppsv("U", k, 1, A, bx, k, info) -val code = info.`val` -assert(code == 0, s"lapack.dppsv returned $code.") +checkReturnValue(info, "dppsv") bx } @@ -52,8 +51,20 @@ private[spark] object CholeskyDecomposition { def inverse(UAi: Array[Double], k: Int): Array[Double] = { val info = new intW(0) lapack.dpptri("U", k, UAi, info) -val code = info.`val` -assert(code == 0, s"lapack.dpptri returned $code.") +checkReturnValue(info, "dpptri") UAi } + + private def checkReturnValue(info: intW, method: String): Unit = { +info.`val` match { + case code if code < 0 => +throw new IllegalStateException(s"LAPACK.$method returned $code; arg ${-code} is illegal") + case code if code > 0 => +throw new IllegalArgumentException( + s"LAPACK.$method returned $code because A is not positive definite. Is A derived from " + + "a singular matrix (e.g. collinear column values)?") + case _ => // do nothing +} + } + } http://git-wip-us.apache.org/repos/asf/spark/blob/b4a4421b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala index c8de796..2cb1af0 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala @@ -60,6 +60,26 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } + test("two collinear features result in error with no regularization") { +val singularInstances = sc.parallelize(Seq( + Instance(1.0, 1.0, Vectors.dense(1.0, 2.0)), + Instance(2.0, 1.0, Vectors.dense(2.0, 4.0)), + Instance(3.0, 1.0, Vectors.dense(3.0, 6.0)), + Instance(4.0, 1.0, Vectors.dense(4.0, 8.0)) +), 2) + +intercept[IllegalArgumentException] { + new WeightedLeastSquares( +false, regParam = 0.0, standardizeFeatures = false, +standardizeLabel = false).fit(singularInstances) +} + +// Should not throw an exception +new WeightedLeastSquares( + false, regParam = 1.0, standardizeFeatures = false, + standardizeLabel = false).fit(singularInstances) + } + test("WLS against lm") { /* R code: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published
Repository: spark Updated Branches: refs/heads/branch-1.6 8f88412c3 -> ce0a222f5 [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce0a222f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce0a222f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce0a222f Branch: refs/heads/branch-1.6 Commit: ce0a222f56ffaf85273d2935b3e6d02aa9f6fa48 Parents: 8f88412 Author: Josh Rosen Authored: Wed Sep 21 11:38:10 2016 -0700 Committer: Josh Rosen Committed: Wed Sep 21 11:42:48 2016 -0700 -- extras/kinesis-asl-assembly/pom.xml | 15 +++ 1 file changed, 15 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ce0a222f/extras/kinesis-asl-assembly/pom.xml -- diff --git a/extras/kinesis-asl-assembly/pom.xml b/extras/kinesis-asl-assembly/pom.xml index 98d6d8d..6528e4e 100644 --- a/extras/kinesis-asl-assembly/pom.xml +++ b/extras/kinesis-asl-assembly/pom.xml @@ -132,6 +132,21 @@ target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes + + + org.apache.maven.plugins + maven-deploy-plugin + +true + + + + org.apache.maven.plugins + maven-install-plugin + +true + + org.apache.maven.plugins maven-shade-plugin - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-4563][CORE] Allow driver to advertise a different network address.
Repository: spark Updated Branches: refs/heads/master b4a4421b6 -> 2cd1bfa4f [SPARK-4563][CORE] Allow driver to advertise a different network address. The goal of this feature is to allow the Spark driver to run in an isolated environment, such as a docker container, and be able to use the host's port forwarding mechanism to be able to accept connections from the outside world. The change is restricted to the driver: there is no support for achieving the same thing on executors (or the YARN AM for that matter). Those still need full access to the outside world so that, for example, connections can be made to an executor's block manager. The core of the change is simple: add a new configuration that tells what's the address the driver should bind to, which can be different than the address it advertises to executors (spark.driver.host). Everything else is plumbing the new configuration where it's needed. To use the feature, the host starting the container needs to set up the driver's port range to fall into a range that is being forwarded; this required the block manager port to need a special configuration just for the driver, which falls back to the existing spark.blockManager.port when not set. This way, users can modify the driver settings without affecting the executors; it would theoretically be nice to also have different retry counts for driver and executors, but given that docker (at least) allows forwarding port ranges, we can probably live without that for now. Because of the nature of the feature it's kinda hard to add unit tests; I just added a simple one to make sure the configuration works. This was tested with a docker image running spark-shell with the following command: docker blah blah blah \ -p 38000-38100:38000-38100 \ [image] \ spark-shell \ --num-executors 3 \ --conf spark.shuffle.service.enabled=false \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.driver.host=[host's address] \ --conf spark.driver.port=38000 \ --conf spark.driver.blockManager.port=38020 \ --conf spark.ui.port=38040 Running on YARN; verified the driver works, executors start up and listen on ephemeral ports (instead of using the driver's config), and that caching and shuffling (without the shuffle service) works. Clicked through the UI to make sure all pages (including executor thread dumps) worked. Also tested apps without docker, and ran unit tests. Author: Marcelo Vanzin Closes #15120 from vanzin/SPARK-4563. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2cd1bfa4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2cd1bfa4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2cd1bfa4 Branch: refs/heads/master Commit: 2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 Parents: b4a4421 Author: Marcelo Vanzin Authored: Wed Sep 21 14:42:41 2016 -0700 Committer: Shixiong Zhu Committed: Wed Sep 21 14:42:41 2016 -0700 -- .../main/scala/org/apache/spark/SparkConf.scala | 2 ++ .../scala/org/apache/spark/SparkContext.scala | 5 ++-- .../main/scala/org/apache/spark/SparkEnv.scala | 27 +++- .../spark/internal/config/ConfigProvider.scala | 2 +- .../apache/spark/internal/config/package.scala | 20 +++ .../netty/NettyBlockTransferService.scala | 7 ++--- .../scala/org/apache/spark/rpc/RpcEnv.scala | 17 ++-- .../apache/spark/rpc/netty/NettyRpcEnv.scala| 9 --- .../main/scala/org/apache/spark/ui/WebUI.scala | 5 ++-- .../scala/org/apache/spark/util/Utils.scala | 6 ++--- .../netty/NettyBlockTransferSecuritySuite.scala | 6 +++-- .../netty/NettyBlockTransferServiceSuite.scala | 5 ++-- .../spark/rpc/netty/NettyRpcEnvSuite.scala | 16 ++-- .../storage/BlockManagerReplicationSuite.scala | 2 +- .../spark/storage/BlockManagerSuite.scala | 4 +-- docs/configuration.md | 23 - .../cluster/mesos/MesosSchedulerUtils.scala | 3 ++- ...esosCoarseGrainedSchedulerBackendSuite.scala | 5 ++-- .../mesos/MesosSchedulerUtilsSuite.scala| 3 ++- .../spark/streaming/CheckpointSuite.scala | 4 ++- .../streaming/ReceivedBlockHandlerSuite.scala | 2 +- 21 files changed, 133 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2cd1bfa4/core/src/main/scala/org/apache/spark/SparkConf.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index e85e5aa..51a699f4 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -422,6 +422,8 @@ class SparkConf(loadDefaults: Bool
spark git commit: [SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task.
Repository: spark Updated Branches: refs/heads/master 2cd1bfa4f -> 9fcf1c51d [SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task. ## What changes were proposed in this pull request? In TaskResultGetter, enqueueFailedTask currently deserializes the result as a TaskEndReason. But the type is actually more specific, its a TaskFailedReason. This just leads to more blind casting later on – it would be more clear if the msg was cast to the right type immediately, so method parameter types could be tightened. ## How was this patch tested? Existing unit tests via jenkins. Note that the code was already performing a blind-cast to a TaskFailedReason before in any case, just in a different spot, so there shouldn't be any behavior change. Author: Imran Rashid Closes #15181 from squito/SPARK-17623. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9fcf1c51 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9fcf1c51 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9fcf1c51 Branch: refs/heads/master Commit: 9fcf1c51d518847eda7f5ea71337cfa7def3c45c Parents: 2cd1bfa Author: Imran Rashid Authored: Wed Sep 21 17:49:36 2016 -0400 Committer: Andrew Or Committed: Wed Sep 21 17:49:36 2016 -0400 -- .../apache/spark/executor/CommitDeniedException.scala | 4 ++-- .../main/scala/org/apache/spark/executor/Executor.scala | 4 ++-- .../org/apache/spark/scheduler/TaskResultGetter.scala | 4 ++-- .../org/apache/spark/scheduler/TaskSchedulerImpl.scala | 2 +- .../org/apache/spark/scheduler/TaskSetManager.scala | 12 +++- .../org/apache/spark/shuffle/FetchFailedException.scala | 4 ++-- .../scala/org/apache/spark/util/JsonProtocolSuite.scala | 2 +- 7 files changed, 13 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala -- diff --git a/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala b/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala index 7d84889..326e042 100644 --- a/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala +++ b/core/src/main/scala/org/apache/spark/executor/CommitDeniedException.scala @@ -17,7 +17,7 @@ package org.apache.spark.executor -import org.apache.spark.{TaskCommitDenied, TaskEndReason} +import org.apache.spark.{TaskCommitDenied, TaskFailedReason} /** * Exception thrown when a task attempts to commit output to HDFS but is denied by the driver. @@ -29,5 +29,5 @@ private[spark] class CommitDeniedException( attemptNumber: Int) extends Exception(msg) { - def toTaskEndReason: TaskEndReason = TaskCommitDenied(jobID, splitID, attemptNumber) + def toTaskFailedReason: TaskFailedReason = TaskCommitDenied(jobID, splitID, attemptNumber) } http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/executor/Executor.scala -- diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index fbf2b86..668ec41 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -355,7 +355,7 @@ private[spark] class Executor( } catch { case ffe: FetchFailedException => - val reason = ffe.toTaskEndReason + val reason = ffe.toTaskFailedReason setTaskFinishedAndClearInterruptStatus() execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) @@ -370,7 +370,7 @@ private[spark] class Executor( execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled)) case CausedBy(cDE: CommitDeniedException) => - val reason = cDE.toTaskEndReason + val reason = cDE.toTaskFailedReason setTaskFinishedAndClearInterruptStatus() execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) http://git-wip-us.apache.org/repos/asf/spark/blob/9fcf1c51/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala index 685ef55..1c3fcbd 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala @@ -118,14 +118,14 @@ private[spark] class TaskResultGetter(sparkEnv
spark git commit: [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode
Repository: spark Updated Branches: refs/heads/master 9fcf1c51d -> 8c3ee2bc4 [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode ## What changes were proposed in this pull request? Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it. ## How was this patch tested? Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added. Author: jerryshao Closes #15137 from jerryshao/SPARK-17512. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8c3ee2bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8c3ee2bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8c3ee2bc Branch: refs/heads/master Commit: 8c3ee2bc42e6320b9341cebdba51a00162c897ea Parents: 9fcf1c5 Author: jerryshao Authored: Wed Sep 21 17:57:21 2016 -0400 Committer: Andrew Or Committed: Wed Sep 21 17:57:21 2016 -0400 -- .../org/apache/spark/deploy/SparkSubmit.scala| 13 ++--- .../apache/spark/deploy/SparkSubmitSuite.scala | 19 +++ 2 files changed, 29 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8c3ee2bc/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 7b6d5a3..8061165 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -311,7 +311,7 @@ object SparkSubmit { // In Mesos cluster mode, non-local python files are automatically downloaded by Mesos. if (args.isPython && !isYarnCluster && !isMesosCluster) { if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { -printErrorAndExit(s"Only local python files are supported: $args.primaryResource") +printErrorAndExit(s"Only local python files are supported: ${args.primaryResource}") } val nonLocalPyFiles = Utils.nonLocalPaths(args.pyFiles).mkString(",") if (nonLocalPyFiles.nonEmpty) { @@ -322,7 +322,7 @@ object SparkSubmit { // Require all R files to be local if (args.isR && !isYarnCluster) { if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { -printErrorAndExit(s"Only local R files are supported: $args.primaryResource") +printErrorAndExit(s"Only local R files are supported: ${args.primaryResource}") } } @@ -633,7 +633,14 @@ object SparkSubmit { // explicitly sets `spark.submit.pyFiles` in his/her default properties file. sysProps.get("spark.submit.pyFiles").foreach { pyFiles => val resolvedPyFiles = Utils.resolveURIs(pyFiles) - val formattedPyFiles = PythonRunner.formatPaths(resolvedPyFiles).mkString(",") + val formattedPyFiles = if (!isYarnCluster && !isMesosCluster) { +PythonRunner.formatPaths(resolvedPyFiles).mkString(",") + } else { +// Ignoring formatting python path in yarn and mesos cluster mode, these two modes +// support dealing with remote python files, they could distribute and add python files +// locally. +resolvedPyFiles + } sysProps("spark.submit.pyFiles") = formattedPyFiles } http://git-wip-us.apache.org/repos/asf/spark/blob/8c3ee2bc/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala index 961ece3..31c8fb2 100644 --- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala @@ -582,6 +582,25 @@ class SparkSubmitSuite val sysProps3 = SparkSubmit.prepareSubmitEnvironment(appArgs3)._3 sysProps3("spark.submit.pyFiles") should be( PythonRunner.formatPaths(Utils.resolveURIs(pyFiles)).mkString(",")) + +// Test remote python files +val f4 = File.createTempFile("test-submit-remote-python-files", "", tmpDir) +val writer4 = new PrintWriter(f4) +val remotePyFiles = "hdfs:///tmp/file1.py,hdfs:///tmp/file2.py" +writer4.println("spark.submit.pyFiles " + remotePyFiles) +writer4.close() +val clArgs4 = Seq( + "--master", "yarn", + "--deploy-mode", "cluster", + "--properties-
spark git commit: [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode
Repository: spark Updated Branches: refs/heads/branch-2.0 cd0bd89d7 -> 59e6ab11a [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode ## What changes were proposed in this pull request? Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it. ## How was this patch tested? Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added. Author: jerryshao Closes #15137 from jerryshao/SPARK-17512. (cherry picked from commit 8c3ee2bc42e6320b9341cebdba51a00162c897ea) Signed-off-by: Andrew Or Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/59e6ab11 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/59e6ab11 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/59e6ab11 Branch: refs/heads/branch-2.0 Commit: 59e6ab11a9e27d30ae3477fdc03337ff5f8ab4ec Parents: cd0bd89 Author: jerryshao Authored: Wed Sep 21 17:57:21 2016 -0400 Committer: Andrew Or Committed: Wed Sep 21 17:57:33 2016 -0400 -- .../org/apache/spark/deploy/SparkSubmit.scala| 13 ++--- .../apache/spark/deploy/SparkSubmitSuite.scala | 19 +++ 2 files changed, 29 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/59e6ab11/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 7b6d5a3..8061165 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -311,7 +311,7 @@ object SparkSubmit { // In Mesos cluster mode, non-local python files are automatically downloaded by Mesos. if (args.isPython && !isYarnCluster && !isMesosCluster) { if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { -printErrorAndExit(s"Only local python files are supported: $args.primaryResource") +printErrorAndExit(s"Only local python files are supported: ${args.primaryResource}") } val nonLocalPyFiles = Utils.nonLocalPaths(args.pyFiles).mkString(",") if (nonLocalPyFiles.nonEmpty) { @@ -322,7 +322,7 @@ object SparkSubmit { // Require all R files to be local if (args.isR && !isYarnCluster) { if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { -printErrorAndExit(s"Only local R files are supported: $args.primaryResource") +printErrorAndExit(s"Only local R files are supported: ${args.primaryResource}") } } @@ -633,7 +633,14 @@ object SparkSubmit { // explicitly sets `spark.submit.pyFiles` in his/her default properties file. sysProps.get("spark.submit.pyFiles").foreach { pyFiles => val resolvedPyFiles = Utils.resolveURIs(pyFiles) - val formattedPyFiles = PythonRunner.formatPaths(resolvedPyFiles).mkString(",") + val formattedPyFiles = if (!isYarnCluster && !isMesosCluster) { +PythonRunner.formatPaths(resolvedPyFiles).mkString(",") + } else { +// Ignoring formatting python path in yarn and mesos cluster mode, these two modes +// support dealing with remote python files, they could distribute and add python files +// locally. +resolvedPyFiles + } sysProps("spark.submit.pyFiles") = formattedPyFiles } http://git-wip-us.apache.org/repos/asf/spark/blob/59e6ab11/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala index b2bc886..54693c1 100644 --- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala @@ -577,6 +577,25 @@ class SparkSubmitSuite val sysProps3 = SparkSubmit.prepareSubmitEnvironment(appArgs3)._3 sysProps3("spark.submit.pyFiles") should be( PythonRunner.formatPaths(Utils.resolveURIs(pyFiles)).mkString(",")) + +// Test remote python files +val f4 = File.createTempFile("test-submit-remote-python-files", "", tmpDir) +val writer4 = new PrintWriter(f4) +val remotePyFiles = "hdfs:///tmp/file1.py,hdfs:///tmp/file2.py" +writer4.println("spark.submit.pyFiles " + remotePyFiles) +writer4.close() +
spark git commit: [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster
Repository: spark Updated Branches: refs/heads/master 8c3ee2bc4 -> 7cbe21644 [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster ## What changes were proposed in this pull request? While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again! When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check ## How was this patch tested? Added a unit test to `FileStreamSource`. Author: Burak Yavuz Closes #15122 from brkyvz/SPARK-17569. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7cbe2164 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7cbe2164 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7cbe2164 Branch: refs/heads/master Commit: 7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0 Parents: 8c3ee2b Author: Burak Yavuz Authored: Wed Sep 21 17:12:52 2016 -0700 Committer: Shixiong Zhu Committed: Wed Sep 21 17:12:52 2016 -0700 -- .../sql/execution/datasources/DataSource.scala | 10 +++- .../execution/streaming/FileStreamSource.scala | 3 +- .../streaming/FileStreamSourceSuite.scala | 53 +++- 3 files changed, 62 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index 93154bd..413976a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -316,8 +316,14 @@ case class DataSource( /** * Create a resolved [[BaseRelation]] that can be used to read data from or write data into this * [[DataSource]] + * + * @param checkFilesExist Whether to confirm that the files exist when generating the + *non-streaming file based datasource. StructuredStreaming jobs already + *list file existence, and when generating incremental jobs, the batch + *is considered as a non-streaming file based data source. Since we know + *that files already exist, we don't need to check them again. */ - def resolveRelation(): BaseRelation = { + def resolveRelation(checkFilesExist: Boolean = true): BaseRelation = { val caseInsensitiveOptions = new CaseInsensitiveMap(options) val relation = (providingClass.newInstance(), userSpecifiedSchema) match { // TODO: Throw when too much is given. @@ -368,7 +374,7 @@ case class DataSource( throw new AnalysisException(s"Path does not exist: $qualified") } // Sufficient to check head of the globPath seq for non-glob scenario - if (!fs.exists(globPath.head)) { + if (checkFilesExist && !fs.exists(globPath.head)) { throw new AnalysisException(s"Path does not exist: ${globPath.head}") } globPath http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala index 0dc08b1..5ebc083 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala @@ -133,7 +133,8 @@ class FileStreamSource( userSpecifiedSchema = Some(schema), className = fileFormatClassName, options = sourceOptions.optionMapWithoutPath) -Dataset.ofRows(sparkSession, LogicalRelation(newDataSource.resolveRelation())) +Dataset.ofRows(sparkSession, LogicalRelation(newDataSource.resolveRelation( + checkFilesExist = false))) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/7cbe2164/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceSuite.scala -- diff --git a
spark git commit: [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors
Repository: spark Updated Branches: refs/heads/master 7cbe21644 -> c133907c5 [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors ## What changes were proposed in this pull request? Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```. We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```. ## How was this patch tested? Add unit test. Author: Yanbo Liang Closes #15131 from yanboliang/spark-17577. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c133907c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c133907c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c133907c Branch: refs/heads/master Commit: c133907c5d9a6e6411b896b5e0cff48b2beff09f Parents: 7cbe216 Author: Yanbo Liang Authored: Wed Sep 21 20:08:28 2016 -0700 Committer: Yanbo Liang Committed: Wed Sep 21 20:08:28 2016 -0700 -- R/pkg/NAMESPACE | 3 ++ R/pkg/R/context.R | 48 R/pkg/inst/tests/testthat/test_context.R| 13 ++ .../scala/org/apache/spark/SparkContext.scala | 6 +-- 4 files changed, 67 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index a5e9cbd..267a38c 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -336,6 +336,9 @@ export("as.DataFrame", "read.parquet", "read.text", "spark.lapply", + "spark.addFile", + "spark.getSparkFilesRootDirectory", + "spark.getSparkFiles", "sql", "str", "tableToDF", http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 13ade49..4793578 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -225,6 +225,54 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Add a file or directory to be downloaded with this Spark job on every node. +#' +#' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, +#' use spark.getSparkFiles(fileName) to find its download location. +#' +#' @rdname spark.addFile +#' @param path The path of the file to be added +#' @export +#' @examples +#'\dontrun{ +#' spark.addFile("~/myfile") +#'} +#' @note spark.addFile since 2.1.0 +spark.addFile <- function(path) { + sc <- getSparkContext() + invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path +} + +#' Get the root directory that contains files added through spark.addFile. +#' +#' @rdname spark.getSparkFilesRootDirectory +#' @return the root directory that contains files added through spark.addFile +#' @export +#' @examples +#'\dontrun{ +#' spark.getSparkFilesRootDirectory() +#'} +#' @note spark.getSparkFilesRootDirectory since 2.1.0 +spark.getSparkFilesRootDirectory <- function() { + callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") +} + +#' Get the absolute path of a file added through spark.addFile. +#' +#' @rdname spark.getSparkFiles +#' @param fileName The name of the file added through spark.addFile +#' @return the absolute path of a file added through spark.addFile. +#' @export +#' @examples +#'\dontrun{ +#' spark.getSparkFiles("myfile") +#'} +#' @note spark.getSparkFiles since 2.1.0 +spark.getSparkFiles <- function(fileName) { + callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) +} + #' Run a function over a list of elements, distributing the computations with Spark #' #' Run a function over a list of elements, distributing the computations with Spark. Applies a http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/inst/tests/testthat/test_context.R -- diff --git a/R/pkg/inst/tests/testthat/test_context.R b/R/pkg/inst/tests/testthat/test_context.R index 1ab7f31..0495418 100644 --- a/R/pkg/inst/tests/testthat/test_context.R +++ b/R/pkg/inst/tests/testthat/test_context.R @@ -166,3 +166,16 @@ test_that("spark
spark git commit: [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary
Repository: spark Updated Branches: refs/heads/master c133907c5 -> 6902edab7 [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary ## What changes were proposed in this pull request? #14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect. Running the following code for KSTest: ```Scala data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5)) df <- createDataFrame(data) testResult <- spark.kstest(df, "test", "norm") summary(testResult) ``` Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png) The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R. BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess. ## How was this patch tested? Existing test. Author: Yanbo Liang Closes #15139 from yanboliang/spark-17315. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6902edab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6902edab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6902edab Branch: refs/heads/master Commit: 6902edab7e80e96e3f57cf80f26cefb209d4d63c Parents: c133907 Author: Yanbo Liang Authored: Wed Sep 21 20:14:18 2016 -0700 Committer: Yanbo Liang Committed: Wed Sep 21 20:14:18 2016 -0700 -- R/pkg/R/mllib.R| 16 +--- R/pkg/inst/tests/testthat/test_mllib.R | 16 ++-- 2 files changed, 11 insertions(+), 21 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 234b208..98db367 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -1398,20 +1398,22 @@ setMethod("summary", signature(object = "KSTest"), distParams <- unlist(callJMethod(jobj, "distParams")) degreesOfFreedom <- callJMethod(jobj, "degreesOfFreedom") -list(p.value = pValue, statistic = statistic, nullHypothesis = nullHypothesis, - nullHypothesis.name = distName, nullHypothesis.parameters = distParams, - degreesOfFreedom = degreesOfFreedom) +ans <- list(p.value = pValue, statistic = statistic, nullHypothesis = nullHypothesis, +nullHypothesis.name = distName, nullHypothesis.parameters = distParams, +degreesOfFreedom = degreesOfFreedom, jobj = jobj) +class(ans) <- "summary.KSTest" +ans }) # Prints the summary of KSTest #' @rdname spark.kstest -#' @param x test result object of KSTest by \code{spark.kstest}. +#' @param x summary object of KSTest returned by \code{summary}. #' @export #' @note print.summary.KSTest since 2.1.0 print.summary.KSTest <- function(x, ...) { - jobj <- x@jobj + jobj <- x$jobj summaryStr <- callJMethod(jobj, "summary") - cat(summaryStr) - invisible(summaryStr) + cat(summaryStr, "\n") + invisible(x) } http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/inst/tests/testthat/test_mllib.R -- diff --git a/R/pkg/inst/tests/testthat/test_mllib.R b/R/pkg/inst/tests/testthat/test_mllib.R index 5b1404c..24c40a8 100644 --- a/R/pkg/inst/tests/testthat/test_mllib.R +++ b/R/pkg/inst/tests/testthat/test_mllib.R @@ -760,13 +760,7 @@ test_that("spark.kstest", { expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4) expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4) - - printStr <- print.summary.KSTest(testResult) - expect_match(printStr, paste0("Kolmogorov-Smirnov test summary:\\n", -"degrees of freedom = 0 \\n", -"statistic = 0.38208[0-9]* \\n", -"pValue = 0.19849[0-9]* \\n", -".*"), perl = TRUE) + expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:") testResult <- spark.kstest(df, "test", "norm", -0.5) stats <- summary(testResult) @@ -775,13 +769,7 @@ test_that("spark.kstes
spark git commit: [SPARK-17627] Mark Streaming Providers Experimental
Repository: spark Updated Branches: refs/heads/branch-2.0 59e6ab11a -> 966abd6af [SPARK-17627] Mark Streaming Providers Experimental All of structured streaming is experimental in its first release. We missed the annotation on two of the APIs. Author: Michael Armbrust Closes #15188 from marmbrus/experimentalApi. (cherry picked from commit 3497ebe511fee67e66387e9e737c843a2939ce45) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/966abd6a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/966abd6a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/966abd6a Branch: refs/heads/branch-2.0 Commit: 966abd6af04b8e7b5f6446cba96f1825ca2bfcfa Parents: 59e6ab1 Author: Michael Armbrust Authored: Wed Sep 21 20:59:46 2016 -0700 Committer: Reynold Xin Committed: Wed Sep 21 20:59:52 2016 -0700 -- .../src/main/scala/org/apache/spark/sql/sources/interfaces.scala | 4 1 file changed, 4 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/966abd6a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala index d2077a0..b84953d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala @@ -112,8 +112,10 @@ trait SchemaRelationProvider { } /** + * ::Experimental:: * Implemented by objects that can produce a streaming [[Source]] for a specific format or system. */ +@Experimental trait StreamSourceProvider { /** Returns the name and schema of the source that can be used to continually read data. */ @@ -132,8 +134,10 @@ trait StreamSourceProvider { } /** + * ::Experimental:: * Implemented by objects that can produce a streaming [[Sink]] for a specific format or system. */ +@Experimental trait StreamSinkProvider { def createSink( sqlContext: SQLContext, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17627] Mark Streaming Providers Experimental
Repository: spark Updated Branches: refs/heads/master 6902edab7 -> 3497ebe51 [SPARK-17627] Mark Streaming Providers Experimental All of structured streaming is experimental in its first release. We missed the annotation on two of the APIs. Author: Michael Armbrust Closes #15188 from marmbrus/experimentalApi. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3497ebe5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3497ebe5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3497ebe5 Branch: refs/heads/master Commit: 3497ebe511fee67e66387e9e737c843a2939ce45 Parents: 6902eda Author: Michael Armbrust Authored: Wed Sep 21 20:59:46 2016 -0700 Committer: Reynold Xin Committed: Wed Sep 21 20:59:46 2016 -0700 -- .../src/main/scala/org/apache/spark/sql/sources/interfaces.scala | 4 1 file changed, 4 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3497ebe5/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala index a16d7ed..6484c78 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala @@ -112,8 +112,10 @@ trait SchemaRelationProvider { } /** + * ::Experimental:: * Implemented by objects that can produce a streaming [[Source]] for a specific format or system. */ +@Experimental trait StreamSourceProvider { /** Returns the name and schema of the source that can be used to continually read data. */ @@ -132,8 +134,10 @@ trait StreamSourceProvider { } /** + * ::Experimental:: * Implemented by objects that can produce a streaming [[Sink]] for a specific format or system. */ +@Experimental trait StreamSinkProvider { def createSink( sqlContext: SQLContext, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode
Repository: spark Updated Branches: refs/heads/master 3497ebe51 -> 8bde03bf9 [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode ## What changes were proposed in this pull request? Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long). This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN. ## How was this patch tested? Added regression tests. Author: Davies Liu Closes #15154 from davies/decimal_round. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bde03bf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bde03bf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bde03bf Branch: refs/heads/master Commit: 8bde03bf9a0896ea59ceaa699df7700351a130fb Parents: 3497ebe Author: Davies Liu Authored: Wed Sep 21 21:02:30 2016 -0700 Committer: Reynold Xin Committed: Wed Sep 21 21:02:30 2016 -0700 -- .../org/apache/spark/sql/types/Decimal.scala| 28 +--- .../apache/spark/sql/types/DecimalSuite.scala | 15 +++ 2 files changed, 39 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8bde03bf/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala index cc8175c..7085905 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala @@ -242,10 +242,30 @@ final class Decimal extends Ordered[Decimal] with Serializable { if (scale < _scale) { // Easier case: we just need to divide our scale down val diff = _scale - scale -val droppedDigits = longVal % POW_10(diff) -longVal /= POW_10(diff) -if (math.abs(droppedDigits) * 2 >= POW_10(diff)) { - longVal += (if (longVal < 0) -1L else 1L) +val pow10diff = POW_10(diff) +// % and / always round to 0 +val droppedDigits = longVal % pow10diff +longVal /= pow10diff +roundMode match { + case ROUND_FLOOR => +if (droppedDigits < 0) { + longVal += -1L +} + case ROUND_CEILING => +if (droppedDigits > 0) { + longVal += 1L +} + case ROUND_HALF_UP => +if (math.abs(droppedDigits) * 2 >= pow10diff) { + longVal += (if (droppedDigits < 0) -1L else 1L) +} + case ROUND_HALF_EVEN => +val doubled = math.abs(droppedDigits) * 2 +if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 0) { + longVal += (if (droppedDigits < 0) -1L else 1L) +} + case _ => +sys.error(s"Not supported rounding mode: $roundMode") } } else if (scale > _scale) { // We might be able to multiply longVal by a power of 10 and not overflow, but if not, http://git-wip-us.apache.org/repos/asf/spark/blob/8bde03bf/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala index a10c0e3..52d0692 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.types import org.scalatest.PrivateMethodTester import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.Decimal._ class DecimalSuite extends SparkFunSuite with PrivateMethodTester { /** Check that a Decimal has the given string representation, precision and scale */ @@ -191,4 +192,18 @@ class DecimalSuite extends SparkFunSuite with PrivateMethodTester { assert(new Decimal().set(100L, 10, 0).toUnscaledLong === 100L) assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue) } + + test("changePrecision() on compact decimal should respect rounding mode") { +Seq(ROUND_FLOOR, ROUND_CEILING, ROUND_HALF_UP, ROUND_HALF_EVEN).foreach { mode => + Seq("0.4", "0.5", "0.6", "1.0", "1.1", "1.6", "2.5", "5.5").foreach { n => +Seq("", "-").foreach { sign => + val bd = BigDecimal(sign + n) + val unscale
spark git commit: [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode
Repository: spark Updated Branches: refs/heads/branch-2.0 966abd6af -> ec377e773 [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode ## What changes were proposed in this pull request? Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long). This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN. ## How was this patch tested? Added regression tests. Author: Davies Liu Closes #15154 from davies/decimal_round. (cherry picked from commit 8bde03bf9a0896ea59ceaa699df7700351a130fb) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec377e77 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec377e77 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec377e77 Branch: refs/heads/branch-2.0 Commit: ec377e77307b477d20a642edcd5ad5e26b989de6 Parents: 966abd6 Author: Davies Liu Authored: Wed Sep 21 21:02:30 2016 -0700 Committer: Reynold Xin Committed: Wed Sep 21 21:02:42 2016 -0700 -- .../org/apache/spark/sql/types/Decimal.scala| 28 +--- .../apache/spark/sql/types/DecimalSuite.scala | 15 +++ 2 files changed, 39 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ec377e77/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala index cc8175c..7085905 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala @@ -242,10 +242,30 @@ final class Decimal extends Ordered[Decimal] with Serializable { if (scale < _scale) { // Easier case: we just need to divide our scale down val diff = _scale - scale -val droppedDigits = longVal % POW_10(diff) -longVal /= POW_10(diff) -if (math.abs(droppedDigits) * 2 >= POW_10(diff)) { - longVal += (if (longVal < 0) -1L else 1L) +val pow10diff = POW_10(diff) +// % and / always round to 0 +val droppedDigits = longVal % pow10diff +longVal /= pow10diff +roundMode match { + case ROUND_FLOOR => +if (droppedDigits < 0) { + longVal += -1L +} + case ROUND_CEILING => +if (droppedDigits > 0) { + longVal += 1L +} + case ROUND_HALF_UP => +if (math.abs(droppedDigits) * 2 >= pow10diff) { + longVal += (if (droppedDigits < 0) -1L else 1L) +} + case ROUND_HALF_EVEN => +val doubled = math.abs(droppedDigits) * 2 +if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 0) { + longVal += (if (droppedDigits < 0) -1L else 1L) +} + case _ => +sys.error(s"Not supported rounding mode: $roundMode") } } else if (scale > _scale) { // We might be able to multiply longVal by a power of 10 and not overflow, but if not, http://git-wip-us.apache.org/repos/asf/spark/blob/ec377e77/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala index e1675c9..4cf329d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala @@ -22,6 +22,7 @@ import scala.language.postfixOps import org.scalatest.PrivateMethodTester import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.Decimal._ class DecimalSuite extends SparkFunSuite with PrivateMethodTester { /** Check that a Decimal has the given string representation, precision and scale */ @@ -193,4 +194,18 @@ class DecimalSuite extends SparkFunSuite with PrivateMethodTester { assert(new Decimal().set(100L, 10, 0).toUnscaledLong === 100L) assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue) } + + test("changePrecision() on compact decimal should respect rounding mode") { +Seq(ROUND_FLOOR, ROUND_CEILING, ROUND_HALF_UP, ROUND_HALF_EVEN).foreach { mode => + Seq("0.4", "0.5", "0.6", "1.0", "1.1", "1.6", "2.5", "5.5").foreach { n =>
spark git commit: Bump doc version for release 2.0.1.
Repository: spark Updated Branches: refs/heads/branch-2.0 ec377e773 -> 053b20a79 Bump doc version for release 2.0.1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/053b20a7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/053b20a7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/053b20a7 Branch: refs/heads/branch-2.0 Commit: 053b20a79c1824917c17405f30a7b91472311abe Parents: ec377e7 Author: Reynold Xin Authored: Wed Sep 21 21:06:47 2016 -0700 Committer: Reynold Xin Committed: Wed Sep 21 21:06:47 2016 -0700 -- docs/_config.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/053b20a7/docs/_config.yml -- diff --git a/docs/_config.yml b/docs/_config.yml index 3951cad..75c89bd 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -14,8 +14,8 @@ include: # These allow the documentation to be updated with newer releases # of Spark, Scala, and Mesos. -SPARK_VERSION: 2.0.0 -SPARK_VERSION_SHORT: 2.0.0 +SPARK_VERSION: 2.0.1 +SPARK_VERSION_SHORT: 2.0.1 SCALA_BINARY_VERSION: "2.11" SCALA_VERSION: "2.11.7" MESOS_VERSION: 0.21.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Preparing Spark release v2.0.1-rc1
Repository: spark Updated Branches: refs/heads/branch-2.0 053b20a79 -> e8b26be9b Preparing Spark release v2.0.1-rc1 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00f2e28e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00f2e28e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00f2e28e Branch: refs/heads/branch-2.0 Commit: 00f2e28edd5a74f75e8b4c58894eeb3a394649d7 Parents: 053b20a Author: Patrick Wendell Authored: Wed Sep 21 21:09:08 2016 -0700 Committer: Patrick Wendell Committed: Wed Sep 21 21:09:08 2016 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 34 files changed, 34 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 507ddc7..6db3a59 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index e170b9b..269b845 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 8b832cf..20cf29e 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 07d9f1c..25cc328 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 5e02efd..37a5d09 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/00f2e28e/common/tags/pom.xml -- diff --git a/common/tags/pom.xml b/common/tags/pom.xml index e7fc6a2..ab287f3 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.1-rc1 [created] 00f2e28ed - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 2.0.2-SNAPSHOT
Preparing development version 2.0.2-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8b26be9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8b26be9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8b26be9 Branch: refs/heads/branch-2.0 Commit: e8b26be9bf2b7c46f4076b4d0597ea8451d3 Parents: 00f2e28 Author: Patrick Wendell Authored: Wed Sep 21 21:09:19 2016 -0700 Committer: Patrick Wendell Committed: Wed Sep 21 21:09:19 2016 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 34 files changed, 34 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 6db3a59..ca6daa2 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2.0.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 269b845..c727f54 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2.0.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 20cf29e..e335a89 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2.0.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 25cc328..8e64f56 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2.0.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 37a5d09..94c75d6 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2.0.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/e8b26be9/common/tags/pom.xml -- diff --git a/common/tags/pom.xml b/common/tags/pom.xml index ab287f3..6ff14d2 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1 +2
spark git commit: [SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view
Repository: spark Updated Branches: refs/heads/master 8bde03bf9 -> b50b34f56 [SPARK-17609][SQL] SessionCatalog.tableExists should not check temp view ## What changes were proposed in this pull request? After #15054 , there is no place in Spark SQL that need `SessionCatalog.tableExists` to check temp views, so this PR makes `SessionCatalog.tableExists` only check permanent table/view and removes some hacks. This PR also improves the `getTempViewOrPermanentTableMetadata` that is introduced in #15054 , to make the code simpler. ## How was this patch tested? existing tests Author: Wenchen Fan Closes #15160 from cloud-fan/exists. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b50b34f5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b50b34f5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b50b34f5 Branch: refs/heads/master Commit: b50b34f5611a1f182ba9b6eaf86c666bbd9f9eb0 Parents: 8bde03b Author: Wenchen Fan Authored: Thu Sep 22 12:52:09 2016 +0800 Committer: Wenchen Fan Committed: Thu Sep 22 12:52:09 2016 +0800 -- .../sql/catalyst/catalog/SessionCatalog.scala | 70 ++-- .../catalyst/catalog/SessionCatalogSuite.scala | 30 + .../org/apache/spark/sql/DataFrameWriter.scala | 9 +-- .../command/createDataSourceTables.scala| 15 ++--- .../spark/sql/execution/command/ddl.scala | 43 +--- .../spark/sql/execution/command/tables.scala| 17 + .../apache/spark/sql/internal/CatalogImpl.scala | 6 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 2 +- .../spark/sql/hive/execution/HiveDDLSuite.scala | 4 +- 9 files changed, 81 insertions(+), 115 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b50b34f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index ef29c75..8c01c7a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -246,6 +246,16 @@ class SessionCatalog( } /** + * Return whether a table/view with the specified name exists. If no database is specified, check + * with current database. + */ + def tableExists(name: TableIdentifier): Boolean = synchronized { +val db = formatDatabaseName(name.database.getOrElse(currentDb)) +val table = formatTableName(name.table) +externalCatalog.tableExists(db, table) + } + + /** * Retrieve the metadata of an existing permanent table/view. If no database is specified, * assume the table/view is in the current database. If the specified table/view is not found * in the database then a [[NoSuchTableException]] is thrown. @@ -271,24 +281,6 @@ class SessionCatalog( } /** - * Retrieve the metadata of an existing temporary view or permanent table/view. - * If the temporary view does not exist, tries to get the metadata an existing permanent - * table/view. If no database is specified, assume the table/view is in the current database. - * If the specified table/view is not found in the database then a [[NoSuchTableException]] is - * thrown. - */ - def getTempViewOrPermanentTableMetadata(name: String): CatalogTable = synchronized { -val table = formatTableName(name) -getTempView(table).map { plan => - CatalogTable( -identifier = TableIdentifier(table), -tableType = CatalogTableType.VIEW, -storage = CatalogStorageFormat.empty, -schema = plan.output.toStructType) -}.getOrElse(getTableMetadata(TableIdentifier(name))) - } - - /** * Load files stored in given path into an existing metastore table. * If no database is specified, assume the table is in the current database. * If the specified table is not found in the database then a [[NoSuchTableException]] is thrown. @@ -369,6 +361,30 @@ class SessionCatalog( // - /** + * Retrieve the metadata of an existing temporary view or permanent table/view. + * + * If a database is specified in `name`, this will return the metadata of table/view in that + * database. + * If no database is specified, this will first attempt to get the metadata of a temporary view + * with the same name, then, if that does not exist, return the metadata of table/view in the + * current database. + */ + def getTempViewOrPermanentTableMetadata(name: TableIdentifier): CatalogTable = synchronized {
spark git commit: [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table
Repository: spark Updated Branches: refs/heads/master b50b34f56 -> cb324f611 [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table ## What changes were proposed in this pull request? The PR will override the `sameResult` in `HiveTableScanExec` to make `ReuseExchange` work in text format table. ## How was this patch tested? # SQL ```sql SELECT * FROM src t1 JOIN src t2 ON t1.key = t2.key JOIN src t3 ON t1.key = t3.key; ``` # Before ``` == Physical Plan == *BroadcastHashJoin [key#30], [key#34], Inner, BuildRight :- *BroadcastHashJoin [key#30], [key#32], Inner, BuildRight : :- *Filter isnotnull(key#30) : : +- HiveTableScan [key#30, value#31], MetastoreRelation default, src : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *Filter isnotnull(key#32) :+- HiveTableScan [key#32, value#33], MetastoreRelation default, src +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- *Filter isnotnull(key#34) +- HiveTableScan [key#34, value#35], MetastoreRelation default, src ``` # After ``` == Physical Plan == *BroadcastHashJoin [key#2], [key#6], Inner, BuildRight :- *BroadcastHashJoin [key#2], [key#4], Inner, BuildRight : :- *Filter isnotnull(key#2) : : +- HiveTableScan [key#2, value#3], MetastoreRelation default, src : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *Filter isnotnull(key#4) :+- HiveTableScan [key#4, value#5], MetastoreRelation default, src +- ReusedExchange [key#6, value#7], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) ``` cc: davies cloud-fan Author: Yadong Qi Closes #14988 from watermen/SPARK-17425. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb324f61 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb324f61 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb324f61 Branch: refs/heads/master Commit: cb324f61150c962aeabf0a779f6a09797b3d5072 Parents: b50b34f Author: Yadong Qi Authored: Thu Sep 22 13:04:42 2016 +0800 Committer: Wenchen Fan Committed: Thu Sep 22 13:04:42 2016 +0800 -- .../spark/sql/hive/execution/HiveTableScanExec.scala | 15 +++ 1 file changed, 15 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb324f61/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala index a716a3e..231f204 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala @@ -164,4 +164,19 @@ case class HiveTableScanExec( } override def output: Seq[Attribute] = attributes + + override def sameResult(plan: SparkPlan): Boolean = plan match { +case other: HiveTableScanExec => + val thisPredicates = partitionPruningPred.map(cleanExpression) + val otherPredicates = other.partitionPruningPred.map(cleanExpression) + + val result = relation.sameResult(other.relation) && +output.length == other.output.length && + output.zip(other.output) +.forall(p => p._1.name == p._2.name && p._1.dataType == p._2.dataType) && + thisPredicates.length == otherPredicates.length && +thisPredicates.zip(otherPredicates).forall(p => p._1.semanticEquals(p._2)) + result +case _ => false + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider
Repository: spark Updated Branches: refs/heads/master cb324f611 -> 3a80f92f8 [SPARK-17492][SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider ### What changes were proposed in this pull request? For data sources without extending `SchemaRelationProvider`, we expect users to not specify schemas when they creating tables. If the schema is input from users, an exception is issued. Since Spark 2.1, for any data source, to avoid infer the schema every time, we store the schema in the metastore catalog. Thus, when reading a cataloged data source table, the schema could be read from metastore catalog. In this case, we also got an exception. For example, ```Scala sql( s""" |CREATE TABLE relationProvierWithSchema |USING org.apache.spark.sql.sources.SimpleScanSource |OPTIONS ( | From '1', | To '10' |) """.stripMargin) spark.table(tableName).show() ``` ``` org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified schemas.; ``` This PR is to fix the above issue. When building a data source, we introduce a flag `isSchemaFromUsers` to indicate whether the schema is really input from users. If true, we issue an exception. Otherwise, we will call the `createRelation` of `RelationProvider` to generate the `BaseRelation`, in which it contains the actual schema. ### How was this patch tested? Added a few cases. Author: gatorsmile Closes #15046 from gatorsmile/tempViewCases. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a80f92f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a80f92f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a80f92f Branch: refs/heads/master Commit: 3a80f92f8f4b91d0a85724bca7d81c6f5bbb78fd Parents: cb324f6 Author: gatorsmile Authored: Thu Sep 22 13:19:06 2016 +0800 Committer: Wenchen Fan Committed: Thu Sep 22 13:19:06 2016 +0800 -- .../sql/execution/datasources/DataSource.scala | 9 ++- .../apache/spark/sql/sources/InsertSuite.scala | 20 ++ .../spark/sql/sources/TableScanSuite.scala | 64 +--- .../sql/test/DataFrameReaderWriterSuite.scala | 33 ++ 4 files changed, 102 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3a80f92f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index 413976a..3206701 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -333,8 +333,13 @@ case class DataSource( dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) case (_: SchemaRelationProvider, None) => throw new AnalysisException(s"A schema needs to be specified when using $className.") - case (_: RelationProvider, Some(_)) => -throw new AnalysisException(s"$className does not allow user-specified schemas.") + case (dataSource: RelationProvider, Some(schema)) => +val baseRelation = + dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) +if (baseRelation.schema != schema) { + throw new AnalysisException(s"$className does not allow user-specified schemas.") +} +baseRelation // We are reading from the results of a streaming query. Load files from the metadata log // instead of listing them using HDFS APIs. http://git-wip-us.apache.org/repos/asf/spark/blob/3a80f92f/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala index 6454d71..5eb5464 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala @@ -65,6 +65,26 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { ) } + test("insert into a temp view that does not point to an insertable data source") { +import testImplicits._ +withTempView("t1", "t2") { + sql( +""" + |CREATE TEMPORARY VIEW t1 + |USING org.apache.spark.sql.sources.SimpleScanSource + |OPTIONS ( + | From '1', + | To '10') +""".stripMargin) + sparkContext.parallelize(1 to 10).
spark git commit: [SPARK-17625][SQL] set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation
Repository: spark Updated Branches: refs/heads/master 3a80f92f8 -> de7df7def [SPARK-17625][SQL] set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation ## What changes were proposed in this pull request? We should set expectedOutputAttributes when converting SimpleCatalogRelation to LogicalRelation, otherwise the outputs of LogicalRelation are different from outputs of SimpleCatalogRelation - they have different exprId's. ## How was this patch tested? add a test case Author: Zhenhua Wang Closes #15182 from wzhfy/expectedAttributes. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de7df7de Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de7df7de Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de7df7de Branch: refs/heads/master Commit: de7df7defc99e04fefd990974151a701f64b75b4 Parents: 3a80f92 Author: Zhenhua Wang Authored: Thu Sep 22 14:48:49 2016 +0800 Committer: Wenchen Fan Committed: Thu Sep 22 14:48:49 2016 +0800 -- .../execution/datasources/DataSourceStrategy.scala| 10 +++--- .../scala/org/apache/spark/sql/DataFrameSuite.scala | 14 +- 2 files changed, 20 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/de7df7de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index c8ad5b3..63f01c5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -197,7 +197,10 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] { * source information. */ class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan] { - private def readDataSourceTable(sparkSession: SparkSession, table: CatalogTable): LogicalPlan = { + private def readDataSourceTable( + sparkSession: SparkSession, + simpleCatalogRelation: SimpleCatalogRelation): LogicalPlan = { +val table = simpleCatalogRelation.catalogTable val dataSource = DataSource( sparkSession, @@ -209,16 +212,17 @@ class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan] LogicalRelation( dataSource.resolveRelation(), + expectedOutputAttributes = Some(simpleCatalogRelation.output), catalogTable = Some(table)) } override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case i @ logical.InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _) if DDLUtils.isDatasourceTable(s.metadata) => - i.copy(table = readDataSourceTable(sparkSession, s.metadata)) + i.copy(table = readDataSourceTable(sparkSession, s)) case s: SimpleCatalogRelation if DDLUtils.isDatasourceTable(s.metadata) => - readDataSourceTable(sparkSession, s.metadata) + readDataSourceTable(sparkSession, s) } } http://git-wip-us.apache.org/repos/asf/spark/blob/de7df7de/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index c2d256b..2c60a7d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -26,7 +26,8 @@ import scala.util.Random import org.scalatest.Matchers._ import org.apache.spark.SparkException -import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Union} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Project, Union} import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.execution.aggregate.HashAggregateExec import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, ReusedExchangeExec, ShuffleExchange} @@ -1585,4 +1586,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { val d = sampleDf.withColumn("c", monotonically_increasing_id).select($"c").collect assert(d.size == d.distinct.size) } + + test("SPARK-17625: data source table in InMemoryCatalog should guarantee output consistency") { +val tableName = "tbl" +withTable(tableName) { + spark.range(10).select('id as 'i, 'id as 'j).write.saveAsTable(tableName) + val relation = spark.s