spark git commit: [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order
Repository: spark Updated Branches: refs/heads/master 6c5fd977f -> 5a3533e77 [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order ## What changes were proposed in this pull request? Currently, `crosstab` returns a Dataframe having **random-order** columns obtained by just `distinct`. Also, the documentation of `crosstab` shows the result in a sorted order which is different from the current implementation. This PR explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation. **Before** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| 3| 2| 1| +-+---+---+---+ |2| 1| 0| 2| |1| 0| 1| 1| |3| 1| 1| 0| +-+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| c| a| b| +-+---+---+---+ |2| 1| 2| 0| |1| 0| 1| 1| |3| 1| 0| 1| +-+---+---+---+ ``` **After** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| 1| 2| 3| +-+---+---+---+ |2| 2| 0| 1| |1| 1| 1| 0| |3| 0| 1| 1| +-+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| a| b| c| +-+---+---+---+ |2| 2| 0| 1| |1| 1| 1| 0| |3| 0| 1| 1| +-+---+---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with updated testcases. Author: Dongjoon Hyun Closes #13436 from dongjoon-hyun/SPARK-15696. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5a3533e7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5a3533e7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5a3533e7 Branch: refs/heads/master Commit: 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a Parents: 6c5fd97 Author: Dongjoon Hyun Authored: Thu Jun 9 22:46:51 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 22:46:51 2016 -0700 -- .../org/apache/spark/sql/execution/stat/StatFunctions.scala | 4 ++-- .../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5a3533e7/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala index 9c04061..ea58df7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala @@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging { def cleanElement(element: Any): String = { if (element == null) "null" else element.toString } -// get the distinct values of column 2, so that we can make them the column names +// get the distinct sorted values of column 2, so that we can make them the column names val distinctCol2: Map[Any, Int] = - counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap + counts.map(e => cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap val columnSize = distinctCol2.size require(columnSize < 1e4, s"The number of distinct values for $col2, can't " + s"exceed 1e4. Currently $columnSize") http://git-wip-us.apache.org/repos/asf/spark/blob/5a3533e7/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java index 1e8f106..0152f3f 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java @@ -246,8 +246,8 @@ public class JavaDataFrameSuite { Dataset crosstab = df.stat().crosstab("a", "b"); String[] columnNames = crosstab.schema().fieldNames(); Assert.asser
spark git commit: [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order
Repository: spark Updated Branches: refs/heads/branch-2.0 ebbbf2136 -> 1371d5ece [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order ## What changes were proposed in this pull request? Currently, `crosstab` returns a Dataframe having **random-order** columns obtained by just `distinct`. Also, the documentation of `crosstab` shows the result in a sorted order which is different from the current implementation. This PR explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation. **Before** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| 3| 2| 1| +-+---+---+---+ |2| 1| 0| 2| |1| 0| 1| 1| |3| 1| 1| 0| +-+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| c| a| b| +-+---+---+---+ |2| 1| 2| 0| |1| 0| 1| 1| |3| 1| 0| 1| +-+---+---+---+ ``` **After** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| 1| 2| 3| +-+---+---+---+ |2| 2| 0| 1| |1| 1| 1| 0| |3| 0| 1| 1| +-+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +-+---+---+---+ |key_value| a| b| c| +-+---+---+---+ |2| 2| 0| 1| |1| 1| 1| 0| |3| 0| 1| 1| +-+---+---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with updated testcases. Author: Dongjoon Hyun Closes #13436 from dongjoon-hyun/SPARK-15696. (cherry picked from commit 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1371d5ec Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1371d5ec Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1371d5ec Branch: refs/heads/branch-2.0 Commit: 1371d5ecedb76d138dc9f431e5b40e36a58ed9ca Parents: ebbbf21 Author: Dongjoon Hyun Authored: Thu Jun 9 22:46:51 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 22:46:58 2016 -0700 -- .../org/apache/spark/sql/execution/stat/StatFunctions.scala | 4 ++-- .../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala index 9c04061..ea58df7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala @@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging { def cleanElement(element: Any): String = { if (element == null) "null" else element.toString } -// get the distinct values of column 2, so that we can make them the column names +// get the distinct sorted values of column 2, so that we can make them the column names val distinctCol2: Map[Any, Int] = - counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap + counts.map(e => cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap val columnSize = distinctCol2.size require(columnSize < 1e4, s"The number of distinct values for $col2, can't " + s"exceed 1e4. Currently $columnSize") http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java index 1e8f106..0152f3f 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java @@ -246,8 +246,8 @@ public class JavaDataFrameSuite { Dataset crosstab = d
spark git commit: [SPARK-15791] Fix NPE in ScalarSubquery
Repository: spark Updated Branches: refs/heads/master 16df133d7 -> 6c5fd977f [SPARK-15791] Fix NPE in ScalarSubquery ## What changes were proposed in this pull request? The fix is pretty simple, just don't make the executedPlan transient in `ScalarSubquery` since it is referenced at execution time. ## How was this patch tested? I verified the fix manually in non-local mode. It's not clear to me why the problem did not manifest in local mode, any suggestions? cc davies Author: Eric Liang Closes #13569 from ericl/fix-scalar-npe. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c5fd977 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c5fd977 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c5fd977 Branch: refs/heads/master Commit: 6c5fd977fbcb821a57cb4a13bc3d413a695fbc32 Parents: 16df133 Author: Eric Liang Authored: Thu Jun 9 22:28:31 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 22:28:31 2016 -0700 -- .../scala/org/apache/spark/sql/execution/subquery.scala | 2 +- .../src/test/scala/org/apache/spark/sql/QueryTest.scala | 10 -- .../test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 3 ++- .../test/scala/org/apache/spark/sql/SubquerySuite.scala | 4 4 files changed, 15 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala index 4a1f12d..461d301 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala @@ -32,7 +32,7 @@ import org.apache.spark.sql.types.DataType * This is the physical copy of ScalarSubquery to be used inside SparkPlan. */ case class ScalarSubquery( -@transient executedPlan: SparkPlan, +executedPlan: SparkPlan, exprId: ExprId) extends SubqueryExpression { http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala index 9c044f4..acb59d4 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala @@ -341,10 +341,16 @@ object QueryTest { * * @param df the [[DataFrame]] to be executed * @param expectedAnswer the expected result in a [[Seq]] of [[Row]]s. + * @param checkToRDD whether to verify deserialization to an RDD. This runs the query twice. */ - def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]): Option[String] = { + def checkAnswer( + df: DataFrame, + expectedAnswer: Seq[Row], + checkToRDD: Boolean = true): Option[String] = { val isSorted = df.logicalPlan.collect { case s: logical.Sort => s }.nonEmpty - +if (checkToRDD) { + df.rdd.count() // Also attempt to deserialize as an RDD [SPARK-15791] +} val sparkAnswer = try df.collect().toSeq catch { case e: Exception => http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index 8284e8d..90465b6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -2118,7 +2118,8 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { // is correct. def verifyCallCount(df: DataFrame, expectedResult: Row, expectedCount: Int): Unit = { countAcc.setValue(0) -checkAnswer(df, expectedResult) +QueryTest.checkAnswer( + df, Seq(expectedResult), checkToRDD = false /* avoid duplicate exec */) assert(countAcc.value == expectedCount) } http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala index a932125..05491a4 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala +++
spark git commit: [SPARK-15791] Fix NPE in ScalarSubquery
Repository: spark Updated Branches: refs/heads/branch-2.0 d45aa50fc -> ebbbf2136 [SPARK-15791] Fix NPE in ScalarSubquery ## What changes were proposed in this pull request? The fix is pretty simple, just don't make the executedPlan transient in `ScalarSubquery` since it is referenced at execution time. ## How was this patch tested? I verified the fix manually in non-local mode. It's not clear to me why the problem did not manifest in local mode, any suggestions? cc davies Author: Eric Liang Closes #13569 from ericl/fix-scalar-npe. (cherry picked from commit 6c5fd977fbcb821a57cb4a13bc3d413a695fbc32) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebbbf213 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebbbf213 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebbbf213 Branch: refs/heads/branch-2.0 Commit: ebbbf2136412c628421b88a2bc83091a2b473c55 Parents: d45aa50 Author: Eric Liang Authored: Thu Jun 9 22:28:31 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 22:28:36 2016 -0700 -- .../scala/org/apache/spark/sql/execution/subquery.scala | 2 +- .../src/test/scala/org/apache/spark/sql/QueryTest.scala | 10 -- .../test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 3 ++- .../test/scala/org/apache/spark/sql/SubquerySuite.scala | 4 4 files changed, 15 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala index 4a1f12d..461d301 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala @@ -32,7 +32,7 @@ import org.apache.spark.sql.types.DataType * This is the physical copy of ScalarSubquery to be used inside SparkPlan. */ case class ScalarSubquery( -@transient executedPlan: SparkPlan, +executedPlan: SparkPlan, exprId: ExprId) extends SubqueryExpression { http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala index 9c044f4..acb59d4 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala @@ -341,10 +341,16 @@ object QueryTest { * * @param df the [[DataFrame]] to be executed * @param expectedAnswer the expected result in a [[Seq]] of [[Row]]s. + * @param checkToRDD whether to verify deserialization to an RDD. This runs the query twice. */ - def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]): Option[String] = { + def checkAnswer( + df: DataFrame, + expectedAnswer: Seq[Row], + checkToRDD: Boolean = true): Option[String] = { val isSorted = df.logicalPlan.collect { case s: logical.Sort => s }.nonEmpty - +if (checkToRDD) { + df.rdd.count() // Also attempt to deserialize as an RDD [SPARK-15791] +} val sparkAnswer = try df.collect().toSeq catch { case e: Exception => http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index 8284e8d..90465b6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -2118,7 +2118,8 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { // is correct. def verifyCallCount(df: DataFrame, expectedResult: Row, expectedCount: Int): Unit = { countAcc.setValue(0) -checkAnswer(df, expectedResult) +QueryTest.checkAnswer( + df, Seq(expectedResult), checkToRDD = false /* avoid duplicate exec */) assert(countAcc.value == expectedCount) } http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scal
spark git commit: [SPARK-15850][SQL] Remove function grouping in SparkSession
Repository: spark Updated Branches: refs/heads/branch-2.0 ca0801120 -> d45aa50fc [SPARK-15850][SQL] Remove function grouping in SparkSession ## What changes were proposed in this pull request? SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping and also adds missing scaladocs for createDataset functions in SQLContext. Closes #13577. ## How was this patch tested? N/A - this is a documentation change. Author: Reynold Xin Closes #13582 from rxin/SPARK-15850. (cherry picked from commit 16df133d7f5f3115cd5baa696fa73a4694f9cba9) Signed-off-by: Herman van Hovell Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d45aa50f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d45aa50f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d45aa50f Branch: refs/heads/branch-2.0 Commit: d45aa50fc7e41f2a8a659b2f512a846342a854dd Parents: ca08011 Author: Reynold Xin Authored: Thu Jun 9 18:58:24 2016 -0700 Committer: Herman van Hovell Committed: Thu Jun 9 18:58:37 2016 -0700 -- .../scala/org/apache/spark/sql/SQLContext.scala | 62 +++- .../org/apache/spark/sql/SparkSession.scala | 28 - .../scala/org/apache/spark/sql/functions.scala | 2 +- 3 files changed, 61 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d45aa50f/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 0fb2400..23f2b6e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -51,7 +51,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager * @groupname specificdata Specific Data Sources * @groupname config Configuration * @groupname dataframes Custom DataFrame Creation - * @groupname dataset Custom DataFrame Creation + * @groupname dataset Custom Dataset Creation * @groupname Ungrouped Support functions for language integrated queries * @since 1.0.0 */ @@ -346,15 +346,73 @@ class SQLContext private[sql](val sparkSession: SparkSession) sparkSession.createDataFrame(rowRDD, schema, needsConversion) } - + /** + * :: Experimental :: + * Creates a [[Dataset]] from a local Seq of data of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * == Example == + * + * {{{ + * + * import spark.implicits._ + * case class Person(name: String, age: Long) + * val data = Seq(Person("Michael", 29), Person("Andy", 30), Person("Justin", 19)) + * val ds = spark.createDataset(data) + * + * ds.show() + * // +---+---+ + * // | name|age| + * // +---+---+ + * // |Michael| 29| + * // | Andy| 30| + * // | Justin| 19| + * // +---+---+ + * }}} + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: Seq[T]): Dataset[T] = { sparkSession.createDataset(data) } + /** + * :: Experimental :: + * Creates a [[Dataset]] from an RDD of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = { sparkSession.createDataset(data) } + /** + * :: Experimental :: + * Creates a [[Dataset]] from a [[java.util.List]] of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * == Java Example == + * + * {{{ + * List data = Arrays.asList("hello", "world"); + * Dataset ds = spark.createDataset(data, Encoders.STRING()); + * }}} + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = { sparkSession.createDa
spark git commit: [SPARK-15850][SQL] Remove function grouping in SparkSession
Repository: spark Updated Branches: refs/heads/master 4d9d9cc58 -> 16df133d7 [SPARK-15850][SQL] Remove function grouping in SparkSession ## What changes were proposed in this pull request? SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping and also adds missing scaladocs for createDataset functions in SQLContext. Closes #13577. ## How was this patch tested? N/A - this is a documentation change. Author: Reynold Xin Closes #13582 from rxin/SPARK-15850. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16df133d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16df133d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16df133d Branch: refs/heads/master Commit: 16df133d7f5f3115cd5baa696fa73a4694f9cba9 Parents: 4d9d9cc Author: Reynold Xin Authored: Thu Jun 9 18:58:24 2016 -0700 Committer: Herman van Hovell Committed: Thu Jun 9 18:58:24 2016 -0700 -- .../scala/org/apache/spark/sql/SQLContext.scala | 62 +++- .../org/apache/spark/sql/SparkSession.scala | 28 - .../scala/org/apache/spark/sql/functions.scala | 2 +- 3 files changed, 61 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16df133d/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 0fb2400..23f2b6e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -51,7 +51,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager * @groupname specificdata Specific Data Sources * @groupname config Configuration * @groupname dataframes Custom DataFrame Creation - * @groupname dataset Custom DataFrame Creation + * @groupname dataset Custom Dataset Creation * @groupname Ungrouped Support functions for language integrated queries * @since 1.0.0 */ @@ -346,15 +346,73 @@ class SQLContext private[sql](val sparkSession: SparkSession) sparkSession.createDataFrame(rowRDD, schema, needsConversion) } - + /** + * :: Experimental :: + * Creates a [[Dataset]] from a local Seq of data of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * == Example == + * + * {{{ + * + * import spark.implicits._ + * case class Person(name: String, age: Long) + * val data = Seq(Person("Michael", 29), Person("Andy", 30), Person("Justin", 19)) + * val ds = spark.createDataset(data) + * + * ds.show() + * // +---+---+ + * // | name|age| + * // +---+---+ + * // |Michael| 29| + * // | Andy| 30| + * // | Justin| 19| + * // +---+---+ + * }}} + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: Seq[T]): Dataset[T] = { sparkSession.createDataset(data) } + /** + * :: Experimental :: + * Creates a [[Dataset]] from an RDD of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = { sparkSession.createDataset(data) } + /** + * :: Experimental :: + * Creates a [[Dataset]] from a [[java.util.List]] of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * == Java Example == + * + * {{{ + * List data = Arrays.asList("hello", "world"); + * Dataset ds = spark.createDataset(data, Encoders.STRING()); + * }}} + * + * @since 2.0.0 + * @group dataset + */ + @Experimental def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = { sparkSession.createDataset(data) } http://git-wip-us.apache.org/repos/asf/spark/blob/16df133d/sql/core/src/main/scala/org/apache/
spark git commit: [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream
Repository: spark Updated Branches: refs/heads/branch-2.0 00bbf7873 -> ca0801120 [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream ## What changes were proposed in this pull request? This PR closes the input stream created in `HDFSMetadataLog.get` ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu Closes #13583 from zsxwing/leak. (cherry picked from commit 4d9d9cc5853c467acdb67915117127915a98d8f8) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca080112 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca080112 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca080112 Branch: refs/heads/branch-2.0 Commit: ca0801120b3c650603b98b7838e86fee45f8999f Parents: 00bbf78 Author: Shixiong Zhu Authored: Thu Jun 9 18:45:19 2016 -0700 Committer: Tathagata Das Committed: Thu Jun 9 18:45:39 2016 -0700 -- .../spark/sql/execution/streaming/HDFSMetadataLog.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ca080112/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index fca3d51..069e41b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -175,8 +175,12 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String) val batchMetadataFile = batchIdToPath(batchId) if (fileManager.exists(batchMetadataFile)) { val input = fileManager.open(batchMetadataFile) - val bytes = IOUtils.toByteArray(input) - Some(deserialize(bytes)) + try { +val bytes = IOUtils.toByteArray(input) +Some(deserialize(bytes)) + } finally { +input.close() + } } else { logDebug(s"Unable to find batch $batchMetadataFile") None - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream
Repository: spark Updated Branches: refs/heads/master b914e1930 -> 4d9d9cc58 [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream ## What changes were proposed in this pull request? This PR closes the input stream created in `HDFSMetadataLog.get` ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu Closes #13583 from zsxwing/leak. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d9d9cc5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d9d9cc5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d9d9cc5 Branch: refs/heads/master Commit: 4d9d9cc5853c467acdb67915117127915a98d8f8 Parents: b914e19 Author: Shixiong Zhu Authored: Thu Jun 9 18:45:19 2016 -0700 Committer: Tathagata Das Committed: Thu Jun 9 18:45:19 2016 -0700 -- .../spark/sql/execution/streaming/HDFSMetadataLog.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4d9d9cc5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index fca3d51..069e41b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -175,8 +175,12 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String) val batchMetadataFile = batchIdToPath(batchId) if (fileManager.exists(batchMetadataFile)) { val input = fileManager.open(batchMetadataFile) - val bytes = IOUtils.toByteArray(input) - Some(deserialize(bytes)) + try { +val bytes = IOUtils.toByteArray(input) +Some(deserialize(bytes)) + } finally { +input.close() + } } else { logDebug(s"Unable to find batch $batchMetadataFile") None - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15794] Should truncate toString() of very wide plans
Repository: spark Updated Branches: refs/heads/master 83070cd1d -> b914e1930 [SPARK-15794] Should truncate toString() of very wide plans ## What changes were proposed in this pull request? With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact. It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability. ## How was this patch tested? Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold. ``` numFields = 5 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem)2336 / 2558 0.0 23364.4 0.1X numFields = 25 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem)4237 / 4465 0.0 42367.9 0.1X numFields = 100 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X numFields = Infinity wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative [info] java.lang.OutOfMemoryError: Java heap space ``` Author: Eric Liang Author: Eric Liang Closes #13537 from ericl/truncated-string. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b914e193 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b914e193 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b914e193 Branch: refs/heads/master Commit: b914e1930fd5c5f2808f92d4958ec6fbeddf2e30 Parents: 83070cd Author: Eric Liang Authored: Thu Jun 9 18:05:16 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 18:05:16 2016 -0700 -- .../scala/org/apache/spark/util/Utils.scala | 47 +++ .../org/apache/spark/util/UtilsSuite.scala | 8 .../sql/catalyst/expressions/Expression.scala | 4 +- .../spark/sql/catalyst/trees/TreeNode.scala | 6 +-- .../org/apache/spark/sql/types/StructType.scala | 7 +-- .../spark/sql/execution/ExistingRDD.scala | 10 ++-- .../spark/sql/execution/QueryExecution.scala| 5 +- .../execution/aggregate/HashAggregateExec.scala | 7 +-- .../execution/aggregate/SortAggregateExec.scala | 7 +-- .../execution/datasources/LogicalRelation.scala | 3 +- .../org/apache/spark/sql/execution/limit.scala | 5 +- .../execution/streaming/StreamExecution.scala | 3 +- .../spark/sql/execution/streaming/memory.scala | 3 +- .../benchmark/WideSchemaBenchmark.scala | 49 14 files changed, 140 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b914e193/core/src/main/scala/org/apache/spark/util/Utils.scala -- diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index 1a9dbca..f9d0540 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -26,6 +26,7 @@ import java.nio.charset.StandardCharsets import java.nio.file.Files import java.util.{Locale, Properties, Random, UUID} import java.util.concurrent._ +import java.util.concurrent.atomic.AtomicBoolean import javax.net.ssl.HttpsURLConnection import scala.annotation.tailrec @@ -78,6 +79,52 @@ private[spark] object Utils extends Logging { private val MAX_DIR_CREATION_ATTEMPTS: Int = 10 @volatile private var localRootDirs: Array[String] = null + /** + * The performance overhead of creating and logging strings for wide schemas can be large. To + * limit the impact, we bound the number of fields to include by default. This can be overriden + * by setting the 'spark.debug.maxToStringFields' conf in SparkEnv. + */ + val DEFAULT_MAX_TO_STRING_FIELDS = 25 + + private def maxNumToStringFields = {
spark git commit: [SPARK-15794] Should truncate toString() of very wide plans
Repository: spark Updated Branches: refs/heads/branch-2.0 3119d8eef -> 00bbf7873 [SPARK-15794] Should truncate toString() of very wide plans ## What changes were proposed in this pull request? With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact. It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability. ## How was this patch tested? Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold. ``` numFields = 5 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem)2336 / 2558 0.0 23364.4 0.1X numFields = 25 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem)4237 / 4465 0.0 42367.9 0.1X numFields = 100 wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative 2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X numFields = Infinity wide shallowly nested struct field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative [info] java.lang.OutOfMemoryError: Java heap space ``` Author: Eric Liang Author: Eric Liang Closes #13537 from ericl/truncated-string. (cherry picked from commit b914e1930fd5c5f2808f92d4958ec6fbeddf2e30) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00bbf787 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00bbf787 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00bbf787 Branch: refs/heads/branch-2.0 Commit: 00bbf787340e208cc76230ffd96026c1305f942c Parents: 3119d8e Author: Eric Liang Authored: Thu Jun 9 18:05:16 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 18:05:30 2016 -0700 -- .../scala/org/apache/spark/util/Utils.scala | 47 +++ .../org/apache/spark/util/UtilsSuite.scala | 8 .../sql/catalyst/expressions/Expression.scala | 4 +- .../spark/sql/catalyst/trees/TreeNode.scala | 6 +-- .../org/apache/spark/sql/types/StructType.scala | 7 +-- .../spark/sql/execution/ExistingRDD.scala | 10 ++-- .../spark/sql/execution/QueryExecution.scala| 5 +- .../execution/aggregate/HashAggregateExec.scala | 7 +-- .../execution/aggregate/SortAggregateExec.scala | 7 +-- .../execution/datasources/LogicalRelation.scala | 3 +- .../org/apache/spark/sql/execution/limit.scala | 5 +- .../execution/streaming/StreamExecution.scala | 3 +- .../spark/sql/execution/streaming/memory.scala | 3 +- .../benchmark/WideSchemaBenchmark.scala | 49 14 files changed, 140 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/00bbf787/core/src/main/scala/org/apache/spark/util/Utils.scala -- diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index 1a9dbca..f9d0540 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -26,6 +26,7 @@ import java.nio.charset.StandardCharsets import java.nio.file.Files import java.util.{Locale, Properties, Random, UUID} import java.util.concurrent._ +import java.util.concurrent.atomic.AtomicBoolean import javax.net.ssl.HttpsURLConnection import scala.annotation.tailrec @@ -78,6 +79,52 @@ private[spark] object Utils extends Logging { private val MAX_DIR_CREATION_ATTEMPTS: Int = 10 @volatile private var localRootDirs: Array[String] = null + /** + * The performance overhead of creating and logging strings for wide schemas can be large. To + * limit the impact, we bound the number of fields to include by default. This can be overriden + * by setting the 'spark.debug.maxToStringFields' co
spark git commit: [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.
Repository: spark Updated Branches: refs/heads/branch-2.0 b2d076c35 -> 3119d8eef [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests. Description from JIRA. In ReplSuite, for a test that can be tested well on just local should not really have to start a local-cluster. And similarly a test is in-sufficiently run if it's actually fixing a problem related to a distributed run in environment with local run. Existing tests. Author: Prashant Sharma Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix. (cherry picked from commit 83070cd1d459101e1189f3b07ea59e22f98e84ce) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3119d8ee Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3119d8ee Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3119d8ee Branch: refs/heads/branch-2.0 Commit: 3119d8eef6dc2d0805c87989746cd79882b9cfea Parents: b2d076c Author: Prashant Sharma Authored: Thu Jun 9 17:45:37 2016 -0700 Committer: Shixiong Zhu Committed: Thu Jun 9 17:46:28 2016 -0700 -- .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++-- .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3119d8ee/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala -- diff --git a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala index 19f201f..26b8600 100644 --- a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala +++ b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala @@ -233,7 +233,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-1199 two instances of same class don't type check.") { -val output = runInterpreter("local-cluster[1,1,1024]", +val output = runInterpreter("local", """ |case class Sum(exp: String, exp2: String) |val a = Sum("A", "B") @@ -305,7 +305,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-2632 importing a method from non serializable class and not using it.") { -val output = runInterpreter("local", +val output = runInterpreter("local-cluster[1,1,1024]", """ |class TestClass() { def testMethod = 3 } |val t = new TestClass http://git-wip-us.apache.org/repos/asf/spark/blob/3119d8ee/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala -- diff --git a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala index 48582c1..2444e93 100644 --- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala +++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala @@ -276,7 +276,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-1199 two instances of same class don't type check.") { -val output = runInterpreter("local-cluster[1,1,1024]", +val output = runInterpreter("local", """ |case class Sum(exp: String, exp2: String) |val a = Sum("A", "B") @@ -336,7 +336,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-2632 importing a method from non serializable class and not using it.") { -val output = runInterpreter("local", +val output = runInterpreter("local-cluster[1,1,1024]", """ |class TestClass() { def testMethod = 3 } |val t = new TestClass - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.
Repository: spark Updated Branches: refs/heads/master aa0364510 -> 83070cd1d [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests. Description from JIRA. In ReplSuite, for a test that can be tested well on just local should not really have to start a local-cluster. And similarly a test is in-sufficiently run if it's actually fixing a problem related to a distributed run in environment with local run. Existing tests. Author: Prashant Sharma Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83070cd1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83070cd1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83070cd1 Branch: refs/heads/master Commit: 83070cd1d459101e1189f3b07ea59e22f98e84ce Parents: aa03645 Author: Prashant Sharma Authored: Thu Jun 9 17:45:37 2016 -0700 Committer: Shixiong Zhu Committed: Thu Jun 9 17:45:42 2016 -0700 -- .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++-- .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/83070cd1/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala -- diff --git a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala index 19f201f..26b8600 100644 --- a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala +++ b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala @@ -233,7 +233,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-1199 two instances of same class don't type check.") { -val output = runInterpreter("local-cluster[1,1,1024]", +val output = runInterpreter("local", """ |case class Sum(exp: String, exp2: String) |val a = Sum("A", "B") @@ -305,7 +305,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-2632 importing a method from non serializable class and not using it.") { -val output = runInterpreter("local", +val output = runInterpreter("local-cluster[1,1,1024]", """ |class TestClass() { def testMethod = 3 } |val t = new TestClass http://git-wip-us.apache.org/repos/asf/spark/blob/83070cd1/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala -- diff --git a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala index 48582c1..2444e93 100644 --- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala +++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala @@ -276,7 +276,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-1199 two instances of same class don't type check.") { -val output = runInterpreter("local-cluster[1,1,1024]", +val output = runInterpreter("local", """ |case class Sum(exp: String, exp2: String) |val a = Sum("A", "B") @@ -336,7 +336,7 @@ class ReplSuite extends SparkFunSuite { } test("SPARK-2632 importing a method from non serializable class and not using it.") { -val output = runInterpreter("local", +val output = runInterpreter("local-cluster[1,1,1024]", """ |class TestClass() { def testMethod = 3 } |val t = new TestClass - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12447][YARN] Only update the states when executor is successfully launched
Repository: spark Updated Branches: refs/heads/master b0768538e -> aa0364510 [SPARK-12447][YARN] Only update the states when executor is successfully launched The details is described in https://issues.apache.org/jira/browse/SPARK-12447. vanzin Please help to review, thanks a lot. Author: jerryshao Closes #10412 from jerryshao/SPARK-12447. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa036451 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa036451 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa036451 Branch: refs/heads/master Commit: aa0364510792c18a0973b6096cd38f611fc1c1a6 Parents: b076853 Author: jerryshao Authored: Thu Jun 9 17:31:19 2016 -0700 Committer: Marcelo Vanzin Committed: Thu Jun 9 17:31:19 2016 -0700 -- .../spark/deploy/yarn/ExecutorRunnable.scala| 5 +- .../spark/deploy/yarn/YarnAllocator.scala | 72 2 files changed, 47 insertions(+), 30 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aa036451/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala index fc753b7..3d0e996 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala @@ -55,15 +55,14 @@ private[yarn] class ExecutorRunnable( executorCores: Int, appId: String, securityMgr: SecurityManager, -localResources: Map[String, LocalResource]) - extends Runnable with Logging { +localResources: Map[String, LocalResource]) extends Logging { var rpc: YarnRPC = YarnRPC.create(conf) var nmClient: NMClient = _ val yarnConf: YarnConfiguration = new YarnConfiguration(conf) lazy val env = prepareEnvironment(container) - override def run(): Unit = { + def run(): Unit = { logInfo("Starting Executor Container") nmClient = NMClient.createNMClient() nmClient.init(yarnConf) http://git-wip-us.apache.org/repos/asf/spark/blob/aa036451/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala index 066c665..b110d82 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala @@ -24,6 +24,7 @@ import java.util.regex.Pattern import scala.collection.mutable import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} import scala.collection.JavaConverters._ +import scala.util.control.NonFatal import org.apache.hadoop.conf.Configuration import org.apache.hadoop.yarn.api.records._ @@ -472,41 +473,58 @@ private[yarn] class YarnAllocator( */ private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = { for (container <- containersToUse) { - numExecutorsRunning += 1 - assert(numExecutorsRunning <= targetNumExecutors) + executorIdCounter += 1 val executorHostname = container.getNodeId.getHost val containerId = container.getId - executorIdCounter += 1 val executorId = executorIdCounter.toString - assert(container.getResource.getMemory >= resource.getMemory) - logInfo("Launching container %s for on host %s".format(containerId, executorHostname)) - executorIdToContainer(executorId) = container - containerIdToExecutorId(container.getId) = executorId - - val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, -new HashSet[ContainerId]) - - containerSet += containerId - allocatedContainerToHostMap.put(containerId, executorHostname) - - val executorRunnable = new ExecutorRunnable( -container, -conf, -sparkConf, -driverUrl, -executorId, -executorHostname, -executorMemory, -executorCores, -appAttemptId.getApplicationId.toString, -securityMgr, -localResources) + + def updateInternalState(): Unit = synchronized { +numExecutorsRunning += 1 +assert(numExecutorsRunning <= targetNumExecutors) +executorIdToContainer(executorId) = container +containerIdToExecutorId(container.getId) = executorId + +val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, + new HashSet[ContainerId]) +containerSet += containerId +al
spark git commit: [SPARK-12447][YARN] Only update the states when executor is successfully launched
Repository: spark Updated Branches: refs/heads/branch-2.0 b42e3d886 -> b2d076c35 [SPARK-12447][YARN] Only update the states when executor is successfully launched The details is described in https://issues.apache.org/jira/browse/SPARK-12447. vanzin Please help to review, thanks a lot. Author: jerryshao Closes #10412 from jerryshao/SPARK-12447. (cherry picked from commit aa0364510792c18a0973b6096cd38f611fc1c1a6) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2d076c3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2d076c3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2d076c3 Branch: refs/heads/branch-2.0 Commit: b2d076c35d801219a25b420dd373383c08859a82 Parents: b42e3d8 Author: jerryshao Authored: Thu Jun 9 17:31:19 2016 -0700 Committer: Marcelo Vanzin Committed: Thu Jun 9 17:31:41 2016 -0700 -- .../spark/deploy/yarn/ExecutorRunnable.scala| 5 +- .../spark/deploy/yarn/YarnAllocator.scala | 72 2 files changed, 47 insertions(+), 30 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b2d076c3/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala index fc753b7..3d0e996 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala @@ -55,15 +55,14 @@ private[yarn] class ExecutorRunnable( executorCores: Int, appId: String, securityMgr: SecurityManager, -localResources: Map[String, LocalResource]) - extends Runnable with Logging { +localResources: Map[String, LocalResource]) extends Logging { var rpc: YarnRPC = YarnRPC.create(conf) var nmClient: NMClient = _ val yarnConf: YarnConfiguration = new YarnConfiguration(conf) lazy val env = prepareEnvironment(container) - override def run(): Unit = { + def run(): Unit = { logInfo("Starting Executor Container") nmClient = NMClient.createNMClient() nmClient.init(yarnConf) http://git-wip-us.apache.org/repos/asf/spark/blob/b2d076c3/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala index 066c665..b110d82 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala @@ -24,6 +24,7 @@ import java.util.regex.Pattern import scala.collection.mutable import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} import scala.collection.JavaConverters._ +import scala.util.control.NonFatal import org.apache.hadoop.conf.Configuration import org.apache.hadoop.yarn.api.records._ @@ -472,41 +473,58 @@ private[yarn] class YarnAllocator( */ private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = { for (container <- containersToUse) { - numExecutorsRunning += 1 - assert(numExecutorsRunning <= targetNumExecutors) + executorIdCounter += 1 val executorHostname = container.getNodeId.getHost val containerId = container.getId - executorIdCounter += 1 val executorId = executorIdCounter.toString - assert(container.getResource.getMemory >= resource.getMemory) - logInfo("Launching container %s for on host %s".format(containerId, executorHostname)) - executorIdToContainer(executorId) = container - containerIdToExecutorId(container.getId) = executorId - - val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, -new HashSet[ContainerId]) - - containerSet += containerId - allocatedContainerToHostMap.put(containerId, executorHostname) - - val executorRunnable = new ExecutorRunnable( -container, -conf, -sparkConf, -driverUrl, -executorId, -executorHostname, -executorMemory, -executorCores, -appAttemptId.getApplicationId.toString, -securityMgr, -localResources) + + def updateInternalState(): Unit = synchronized { +numExecutorsRunning += 1 +assert(numExecutorsRunning <= targetNumExecutors) +executorIdToContainer(executorId) = container +containerIdToExecutorId(container.getId) = executorId + +val containerSet = allocatedHostToContainersMap.getOrElseU
spark git commit: [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions
Repository: spark Updated Branches: refs/heads/master 6cb71f473 -> b0768538e [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan. This PR closes https://github.com/apache/spark/pull/13522 ## How was this patch tested? Current tests. Author: Herman van Hovell Author: Rajesh Balamohan Closes #13581 from hvanhovell/SPARK-14321. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0768538 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0768538 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0768538 Branch: refs/heads/master Commit: b0768538e56e5bbda7aaabbe2a0197e30ba5f993 Parents: 6cb71f4 Author: Herman van Hovell Authored: Thu Jun 9 16:37:18 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 16:37:18 2016 -0700 -- .../expressions/datetimeExpressions.scala | 48 ++-- 1 file changed, 24 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b0768538/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala index 69c32f4..773431d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala @@ -399,6 +399,8 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { override def nullable: Boolean = true private lazy val constFormat: UTF8String = right.eval().asInstanceOf[UTF8String] + private lazy val formatter: SimpleDateFormat = +Try(new SimpleDateFormat(constFormat.toString)).getOrElse(null) override def eval(input: InternalRow): Any = { val t = left.eval(input) @@ -411,11 +413,11 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { case TimestampType => t.asInstanceOf[Long] / 100L case StringType if right.foldable => - if (constFormat != null) { -Try(new SimpleDateFormat(constFormat.toString).parse( - t.asInstanceOf[UTF8String].toString).getTime / 1000L).getOrElse(null) - } else { + if (constFormat == null || formatter == null) { null + } else { +Try(formatter.parse( + t.asInstanceOf[UTF8String].toString).getTime / 1000L).getOrElse(null) } case StringType => val f = right.eval(input) @@ -434,13 +436,10 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { left.dataType match { case StringType if right.foldable => val sdf = classOf[SimpleDateFormat].getName -val fString = if (constFormat == null) null else constFormat.toString -val formatter = ctx.freshName("formatter") -if (fString == null) { - ev.copy(code = s""" -boolean ${ev.isNull} = true; -${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};""") +if (formatter == null) { + ExprCode("", "true", ctx.defaultValue(dataType)) } else { + val formatterName = ctx.addReferenceObj("formatter", formatter, sdf) val eval1 = left.genCode(ctx) ev.copy(code = s""" ${eval1.code} @@ -448,10 +447,8 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; if (!${ev.isNull}) { try { -$sdf $formatter = new $sdf("$fString"); -${ev.value} = - $formatter.parse(${eval1.value}.toString()).getTime() / 1000L; - } catch (java.lang.Throwable e) { +${ev.value} = $formatterName.parse(${eval1.value}.toString()).getTime() / 1000L; + } catch (java.text.ParseException e) { ${ev.isNull} = true; } }""") @@ -463,7 +460,9 @@ abstract class UnixTime extend
spark git commit: [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions
Repository: spark Updated Branches: refs/heads/branch-2.0 0408793aa -> b42e3d886 [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan. This PR closes https://github.com/apache/spark/pull/13522 ## How was this patch tested? Current tests. Author: Herman van Hovell Author: Rajesh Balamohan Closes #13581 from hvanhovell/SPARK-14321. (cherry picked from commit b0768538e56e5bbda7aaabbe2a0197e30ba5f993) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b42e3d88 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b42e3d88 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b42e3d88 Branch: refs/heads/branch-2.0 Commit: b42e3d886f1688feb6e9ca85a60ce5e6295e8489 Parents: 0408793 Author: Herman van Hovell Authored: Thu Jun 9 16:37:18 2016 -0700 Committer: Reynold Xin Committed: Thu Jun 9 16:37:29 2016 -0700 -- .../expressions/datetimeExpressions.scala | 48 ++-- 1 file changed, 24 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b42e3d88/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala index 69c32f4..773431d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala @@ -399,6 +399,8 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { override def nullable: Boolean = true private lazy val constFormat: UTF8String = right.eval().asInstanceOf[UTF8String] + private lazy val formatter: SimpleDateFormat = +Try(new SimpleDateFormat(constFormat.toString)).getOrElse(null) override def eval(input: InternalRow): Any = { val t = left.eval(input) @@ -411,11 +413,11 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { case TimestampType => t.asInstanceOf[Long] / 100L case StringType if right.foldable => - if (constFormat != null) { -Try(new SimpleDateFormat(constFormat.toString).parse( - t.asInstanceOf[UTF8String].toString).getTime / 1000L).getOrElse(null) - } else { + if (constFormat == null || formatter == null) { null + } else { +Try(formatter.parse( + t.asInstanceOf[UTF8String].toString).getTime / 1000L).getOrElse(null) } case StringType => val f = right.eval(input) @@ -434,13 +436,10 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { left.dataType match { case StringType if right.foldable => val sdf = classOf[SimpleDateFormat].getName -val fString = if (constFormat == null) null else constFormat.toString -val formatter = ctx.freshName("formatter") -if (fString == null) { - ev.copy(code = s""" -boolean ${ev.isNull} = true; -${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};""") +if (formatter == null) { + ExprCode("", "true", ctx.defaultValue(dataType)) } else { + val formatterName = ctx.addReferenceObj("formatter", formatter, sdf) val eval1 = left.genCode(ctx) ev.copy(code = s""" ${eval1.code} @@ -448,10 +447,8 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; if (!${ev.isNull}) { try { -$sdf $formatter = new $sdf("$fString"); -${ev.value} = - $formatter.parse(${eval1.value}.toString()).getTime() / 1000L; - } catch (java.lang.Throwable e) { +${ev.value} = $formatterName.parse(${eval1.value}.toString()).getTime() / 1000L; + } catch (java.text.ParseException e) { $
spark git commit: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set
Repository: spark Updated Branches: refs/heads/branch-2.0 07a914c09 -> 0408793aa [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set ## What changes were proposed in this pull request? It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in the build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. It seems that passing `-javabootclasspath` to ScalaDoc using scala-maven-plugin ends up preventing the Scala library classes from being added to scalac's internal class path, causing compilation errors while building doc-jars. There might be a principled fix to this inside of the scala-maven-plugin itself, but for now this patch configures the build to omit the `-javabootclasspath` option during Maven doc-jar generation. ## How was this patch tested? Tested manually with `build/mvn clean install -DskipTests=true` when `JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify that the final POM changes were scoped correctly: https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653 /cc vanzin and yhuai for review. Author: Josh Rosen Closes #13573 from JoshRosen/SPARK-15839. (cherry picked from commit 6cb71f4733a920d916b91c66bb2a508a21883b16) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0408793a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0408793a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0408793a Branch: refs/heads/branch-2.0 Commit: 0408793aaab177920363913c9e91ac3af695370a Parents: 07a914c Author: Josh Rosen Authored: Thu Jun 9 12:32:29 2016 -0700 Committer: Yin Huai Committed: Thu Jun 9 12:33:00 2016 -0700 -- pom.xml | 29 +++-- 1 file changed, 23 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0408793a/pom.xml -- diff --git a/pom.xml b/pom.xml index 0419896..7f9ea44 100644 --- a/pom.xml +++ b/pom.xml @@ -2605,12 +2605,29 @@ net.alchim31.maven scala-maven-plugin - - - -javabootclasspath - ${env.JAVA_7_HOME}/jre/lib/rt.jar - - + + + + scala-compile-first + + + -javabootclasspath + ${env.JAVA_7_HOME}/jre/lib/rt.jar + + + + + scala-test-compile-first + + + -javabootclasspath + ${env.JAVA_7_HOME}/jre/lib/rt.jar + + + + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set
Repository: spark Updated Branches: refs/heads/master f74b77713 -> 6cb71f473 [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set ## What changes were proposed in this pull request? It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in the build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. It seems that passing `-javabootclasspath` to ScalaDoc using scala-maven-plugin ends up preventing the Scala library classes from being added to scalac's internal class path, causing compilation errors while building doc-jars. There might be a principled fix to this inside of the scala-maven-plugin itself, but for now this patch configures the build to omit the `-javabootclasspath` option during Maven doc-jar generation. ## How was this patch tested? Tested manually with `build/mvn clean install -DskipTests=true` when `JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify that the final POM changes were scoped correctly: https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653 /cc vanzin and yhuai for review. Author: Josh Rosen Closes #13573 from JoshRosen/SPARK-15839. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6cb71f47 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6cb71f47 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6cb71f47 Branch: refs/heads/master Commit: 6cb71f4733a920d916b91c66bb2a508a21883b16 Parents: f74b777 Author: Josh Rosen Authored: Thu Jun 9 12:32:29 2016 -0700 Committer: Yin Huai Committed: Thu Jun 9 12:32:29 2016 -0700 -- pom.xml | 29 +++-- 1 file changed, 23 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6cb71f47/pom.xml -- diff --git a/pom.xml b/pom.xml index fd40683..89ed87f 100644 --- a/pom.xml +++ b/pom.xml @@ -2605,12 +2605,29 @@ net.alchim31.maven scala-maven-plugin - - - -javabootclasspath - ${env.JAVA_7_HOME}/jre/lib/rt.jar - - + + + + scala-compile-first + + + -javabootclasspath + ${env.JAVA_7_HOME}/jre/lib/rt.jar + + + + + scala-test-compile-first + + + -javabootclasspath + ${env.JAVA_7_HOME}/jre/lib/rt.jar + + + + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central
Repository: spark Updated Branches: refs/heads/branch-1.6 bb917fc65 -> 739d992f0 [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen Closes #13564 from JoshRosen/use-published-fork-of-pom-reader. (cherry picked from commit f74b77713e17960dddb7459eabfdc19f08f4024b) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/739d992f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/739d992f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/739d992f Branch: refs/heads/branch-1.6 Commit: 739d992f041b995fbf44b93cf47bced3d3811ad9 Parents: bb917fc Author: Josh Rosen Authored: Thu Jun 9 11:04:08 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 11:08:22 2016 -0700 -- project/plugins.sbt| 9 + project/project/SparkPluginBuild.scala | 28 2 files changed, 9 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/739d992f/project/plugins.sbt -- diff --git a/project/plugins.sbt b/project/plugins.sbt index c06687d..6a4e401 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -32,3 +32,12 @@ addSbtPlugin("io.spray" % "sbt-revolver" % "0.7.2") libraryDependencies += "org.ow2.asm" % "asm" % "5.0.3" libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.0.3" + +// Spark uses a custom fork of the sbt-pom-reader plugin which contains a patch to fix issues +// related to test-jar dependencies (https://github.com/sbt/sbt-pom-reader/pull/14). The source for +// this fork is published at https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark +// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that repository. +// In the long run, we should try to merge our patch upstream and switch to an upstream version of +// the plugin; this is tracked at SPARK-14401. + +addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark") http://git-wip-us.apache.org/repos/asf/spark/blob/739d992f/project/project/SparkPluginBuild.scala -- diff --git a/project/project/SparkPluginBuild.scala b/project/project/SparkPluginBuild.scala deleted file mode 100644 index cbb88dc..000 --- a/project/project/SparkPluginBuild.scala +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import sbt._ -import sbt.Keys._ - -/** - * This plugin project is there because we use our custom fork of sbt-pom-reader plugin. This is - * a plugin project so that this gets compiled first and is available on the classpath for SBT build. - */ -object SparkPluginDef extends Build { - lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader) - lazy val sbtPomReader = uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";) -} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central
Repository: spark Updated Branches: refs/heads/branch-2.0 10f759947 -> 07a914c09 [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen Closes #13564 from JoshRosen/use-published-fork-of-pom-reader. (cherry picked from commit f74b77713e17960dddb7459eabfdc19f08f4024b) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07a914c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07a914c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07a914c0 Branch: refs/heads/branch-2.0 Commit: 07a914c09767deb67a0088b54ee48929a8567374 Parents: 10f7599 Author: Josh Rosen Authored: Thu Jun 9 11:04:08 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 11:04:27 2016 -0700 -- project/plugins.sbt| 9 + project/project/SparkPluginBuild.scala | 28 2 files changed, 9 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/07a914c0/project/plugins.sbt -- diff --git a/project/plugins.sbt b/project/plugins.sbt index 4578b56..8bebd7b 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -21,3 +21,12 @@ libraryDependencies += "org.ow2.asm" % "asm" % "5.0.3" libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.0.3" addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11") + +// Spark uses a custom fork of the sbt-pom-reader plugin which contains a patch to fix issues +// related to test-jar dependencies (https://github.com/sbt/sbt-pom-reader/pull/14). The source for +// this fork is published at https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark +// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that repository. +// In the long run, we should try to merge our patch upstream and switch to an upstream version of +// the plugin; this is tracked at SPARK-14401. + +addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark") http://git-wip-us.apache.org/repos/asf/spark/blob/07a914c0/project/project/SparkPluginBuild.scala -- diff --git a/project/project/SparkPluginBuild.scala b/project/project/SparkPluginBuild.scala deleted file mode 100644 index cbb88dc..000 --- a/project/project/SparkPluginBuild.scala +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import sbt._ -import sbt.Keys._ - -/** - * This plugin project is there because we use our custom fork of sbt-pom-reader plugin. This is - * a plugin project so that this gets compiled first and is available on the classpath for SBT build. - */ -object SparkPluginDef extends Build { - lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader) - lazy val sbtPomReader = uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";) -} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central
Repository: spark Updated Branches: refs/heads/master e594b4928 -> f74b77713 [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen Closes #13564 from JoshRosen/use-published-fork-of-pom-reader. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f74b7771 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f74b7771 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f74b7771 Branch: refs/heads/master Commit: f74b77713e17960dddb7459eabfdc19f08f4024b Parents: e594b49 Author: Josh Rosen Authored: Thu Jun 9 11:04:08 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 11:04:08 2016 -0700 -- project/plugins.sbt| 9 + project/project/SparkPluginBuild.scala | 28 2 files changed, 9 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f74b7771/project/plugins.sbt -- diff --git a/project/plugins.sbt b/project/plugins.sbt index 4578b56..8bebd7b 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -21,3 +21,12 @@ libraryDependencies += "org.ow2.asm" % "asm" % "5.0.3" libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.0.3" addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11") + +// Spark uses a custom fork of the sbt-pom-reader plugin which contains a patch to fix issues +// related to test-jar dependencies (https://github.com/sbt/sbt-pom-reader/pull/14). The source for +// this fork is published at https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark +// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that repository. +// In the long run, we should try to merge our patch upstream and switch to an upstream version of +// the plugin; this is tracked at SPARK-14401. + +addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark") http://git-wip-us.apache.org/repos/asf/spark/blob/f74b7771/project/project/SparkPluginBuild.scala -- diff --git a/project/project/SparkPluginBuild.scala b/project/project/SparkPluginBuild.scala deleted file mode 100644 index cbb88dc..000 --- a/project/project/SparkPluginBuild.scala +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import sbt._ -import sbt.Keys._ - -/** - * This plugin project is there because we use our custom fork of sbt-pom-reader plugin. This is - * a plugin project so that this gets compiled first and is available on the classpath for SBT build. - */ -object SparkPluginDef extends Build { - lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader) - lazy val sbtPomReader = uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";) -} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property
Repository: spark Updated Branches: refs/heads/branch-2.0 eb9e8fc09 -> 10f759947 [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang Closes #13540 from zjffdu/SPARK-15788. (cherry picked from commit e594b492836988ef3d9487b511368c70169d1ecd) Signed-off-by: Nick Pentreath Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10f75994 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10f75994 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10f75994 Branch: refs/heads/branch-2.0 Commit: 10f7599471dd0d2b5efb49c5e1664fa24dfee074 Parents: eb9e8fc Author: Jeff Zhang Authored: Thu Jun 9 09:54:38 2016 -0700 Committer: Nick Pentreath Committed: Thu Jun 9 09:54:47 2016 -0700 -- python/pyspark/ml/feature.py | 10 ++ 1 file changed, 10 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/10f75994/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 1aff2e5..ebe1300 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -585,6 +585,8 @@ class IDF(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritab ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"]) >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf") >>> model = idf.fit(df) +>>> model.idf +DenseVector([0.0, 0.0]) >>> model.transform(df).head().idf DenseVector([0.0, 0.0]) >>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs @@ -658,6 +660,14 @@ class IDFModel(JavaModel, JavaMLReadable, JavaMLWritable): .. versionadded:: 1.4.0 """ +@property +@since("2.0.0") +def idf(self): +""" +Returns the IDF vector. +""" +return self._call_java("idf") + @inherit_doc class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property
Repository: spark Updated Branches: refs/heads/master 99386fe39 -> e594b4928 [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang Closes #13540 from zjffdu/SPARK-15788. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e594b492 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e594b492 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e594b492 Branch: refs/heads/master Commit: e594b492836988ef3d9487b511368c70169d1ecd Parents: 99386fe Author: Jeff Zhang Authored: Thu Jun 9 09:54:38 2016 -0700 Committer: Nick Pentreath Committed: Thu Jun 9 09:54:38 2016 -0700 -- python/pyspark/ml/feature.py | 10 ++ 1 file changed, 10 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e594b492/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 1aff2e5..ebe1300 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -585,6 +585,8 @@ class IDF(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritab ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"]) >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf") >>> model = idf.fit(df) +>>> model.idf +DenseVector([0.0, 0.0]) >>> model.transform(df).head().idf DenseVector([0.0, 0.0]) >>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs @@ -658,6 +660,14 @@ class IDFModel(JavaModel, JavaMLReadable, JavaMLWritable): .. versionadded:: 1.4.0 """ +@property +@since("2.0.0") +def idf(self): +""" +Returns the IDF vector. +""" +return self._call_java("idf") + @inherit_doc class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType
Repository: spark Updated Branches: refs/heads/master 147c02082 -> 99386fe39 [SPARK-15804][SQL] Include metadata in the toStructType ## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu Closes #13555 from kevinyu98/spark-15804. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99386fe3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99386fe3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99386fe3 Branch: refs/heads/master Commit: 99386fe3989f758844de14b2c28eccfdf8163221 Parents: 147c020 Author: Kevin Yu Authored: Thu Jun 9 09:50:09 2016 -0700 Committer: Wenchen Fan Committed: Thu Jun 9 09:50:09 2016 -0700 -- .../spark/sql/catalyst/expressions/package.scala | 2 +- .../datasources/parquet/ParquetQuerySuite.scala | 15 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/99386fe3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala index 81f5bb4..a6125c6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala @@ -91,7 +91,7 @@ package object expressions { implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { - StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) + StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) } // It's possible that `attrs` is a linked list, which can lead to bad O(n^2) loops when http://git-wip-us.apache.org/repos/asf/spark/blob/99386fe3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala index 78b97f6..ea57f71 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala @@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-15804: write out the metadata to parquet file") { +val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b") +val md = new MetadataBuilder().putString("key", "value").build() +val dfWithmeta = df.select('a, 'b.as("b", md)) + +withTempPath { dir => + val path = dir.getCanonicalPath + dfWithmeta.write.parquet(path) + + readParquetFile(path) { df => +assert(df.schema.last.metadata.getString("key") == "value") + } +} + } } object TestingUDT { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType
Repository: spark Updated Branches: refs/heads/branch-2.0 77c08d224 -> eb9e8fc09 [SPARK-15804][SQL] Include metadata in the toStructType ## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu Closes #13555 from kevinyu98/spark-15804. (cherry picked from commit 99386fe3989f758844de14b2c28eccfdf8163221) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb9e8fc0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb9e8fc0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb9e8fc0 Branch: refs/heads/branch-2.0 Commit: eb9e8fc097384dbe0d2cb83ca5b80968e3539c78 Parents: 77c08d2 Author: Kevin Yu Authored: Thu Jun 9 09:50:09 2016 -0700 Committer: Wenchen Fan Committed: Thu Jun 9 09:50:19 2016 -0700 -- .../spark/sql/catalyst/expressions/package.scala | 2 +- .../datasources/parquet/ParquetQuerySuite.scala | 15 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala index 81f5bb4..a6125c6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala @@ -91,7 +91,7 @@ package object expressions { implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { - StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) + StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) } // It's possible that `attrs` is a linked list, which can lead to bad O(n^2) loops when http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala index 78b97f6..ea57f71 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala @@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-15804: write out the metadata to parquet file") { +val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b") +val md = new MetadataBuilder().putString("key", "value").build() +val dfWithmeta = df.select('a, 'b.as("b", md)) + +withTempPath { dir => + val path = dir.getCanonicalPath + dfWithmeta.write.parquet(path) + + readParquetFile(path) { df => +assert(df.schema.last.metadata.getString("key") == "value") + } +} + } } object TestingUDT { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2
Repository: spark Updated Branches: refs/heads/branch-2.0 8ee93eed9 -> 77c08d224 [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2 ## What changes were proposed in this pull request? Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use https://hadoop.apache.org/docs/r2.7.0/ states "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0. This release is not yet ready for production use. Production users should use 2.7.1 release and beyond." Hadoop 2.7.1 release notes: "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x." And then Hadoop 2.7.2 release notes: "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1." I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master. Author: Adam Roberts Closes #13556 from a-roberts/patch-2. (cherry picked from commit 147c020823080c60b495f7950629d8134bf895db) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77c08d22 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77c08d22 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77c08d22 Branch: refs/heads/branch-2.0 Commit: 77c08d2240bef7d814fc6e4dd0a53fbdf1e2f795 Parents: 8ee93ee Author: Adam Roberts Authored: Thu Jun 9 10:34:01 2016 +0100 Committer: Sean Owen Committed: Thu Jun 9 10:34:15 2016 +0100 -- dev/deps/spark-deps-hadoop-2.4 | 30 +++--- dev/deps/spark-deps-hadoop-2.6 | 30 +++--- dev/deps/spark-deps-hadoop-2.7 | 30 +++--- pom.xml| 6 +++--- 4 files changed, 48 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/77c08d22/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index 77d5266..3df292e 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -53,21 +53,21 @@ eigenbase-properties-1.1.5.jar guava-14.0.1.jar guice-3.0.jar guice-servlet-3.0.jar -hadoop-annotations-2.4.0.jar -hadoop-auth-2.4.0.jar -hadoop-client-2.4.0.jar -hadoop-common-2.4.0.jar -hadoop-hdfs-2.4.0.jar -hadoop-mapreduce-client-app-2.4.0.jar -hadoop-mapreduce-client-common-2.4.0.jar -hadoop-mapreduce-client-core-2.4.0.jar -hadoop-mapreduce-client-jobclient-2.4.0.jar -hadoop-mapreduce-client-shuffle-2.4.0.jar -hadoop-yarn-api-2.4.0.jar -hadoop-yarn-client-2.4.0.jar -hadoop-yarn-common-2.4.0.jar -hadoop-yarn-server-common-2.4.0.jar -hadoop-yarn-server-web-proxy-2.4.0.jar +hadoop-annotations-2.4.1.jar +hadoop-auth-2.4.1.jar +hadoop-client-2.4.1.jar +hadoop-common-2.4.1.jar +hadoop-hdfs-2.4.1.jar +hadoop-mapreduce-client-app-2.4.1.jar +hadoop-mapreduce-client-common-2.4.1.jar +hadoop-mapreduce-client-core-2.4.1.jar +hadoop-mapreduce-client-jobclient-2.4.1.jar +hadoop-mapreduce-client-shuffle-2.4.1.jar +hadoop-yarn-api-2.4.1.jar +hadoop-yarn-client-2.4.1.jar +hadoop-yarn-common-2.4.1.jar +hadoop-yarn-server-common-2.4.1.jar +hadoop-yarn-server-web-proxy-2.4.1.jar hk2-api-2.4.0-b34.jar hk2-locator-2.4.0-b34.jar hk2-utils-2.4.0-b34.jar http://git-wip-us.apache.org/repos/asf/spark/blob/77c08d22/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 9afe50f..9540f58 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -58,21 +58,21 @@ gson-2.2.4.jar guava-14.0.1.jar guice-3.0.jar guice-servlet-3.0.jar -hadoop-annotations-2.6.0.jar -hadoop-auth-2.6.0.jar -hadoop-client-2.6.0.jar -hadoop-common-2.6.0.jar -hadoop-hdfs-2.6.0.jar -hadoop-mapreduce-client-app-2.6.0.jar -hadoop-mapreduce-client-common-2.6.0.jar -hadoop-mapreduce-client-core-2.6.0.jar -hadoop-mapreduce-client-jobclient-2.6.0.jar -hadoop-mapreduce-client-shuffle-2.6.0.jar -hadoop-yarn-api-2.6.0.jar -hadoop-yarn-client-2.6.0.jar -hadoop-yarn-common-2.6.0.jar -hadoop-yarn-server-common-2.6.0.jar -hadoop-yarn-server-web-proxy-2.6.0.jar +hadoop-annotations-2.6.4.jar +hadoop-auth-2.6.4.jar +hadoop-cl
spark git commit: [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2
Repository: spark Updated Branches: refs/heads/master 921fa40b1 -> 147c02082 [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2 ## What changes were proposed in this pull request? Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use https://hadoop.apache.org/docs/r2.7.0/ states "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0. This release is not yet ready for production use. Production users should use 2.7.1 release and beyond." Hadoop 2.7.1 release notes: "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x." And then Hadoop 2.7.2 release notes: "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1." I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master. Author: Adam Roberts Closes #13556 from a-roberts/patch-2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/147c0208 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/147c0208 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/147c0208 Branch: refs/heads/master Commit: 147c020823080c60b495f7950629d8134bf895db Parents: 921fa40 Author: Adam Roberts Authored: Thu Jun 9 10:34:01 2016 +0100 Committer: Sean Owen Committed: Thu Jun 9 10:34:01 2016 +0100 -- dev/deps/spark-deps-hadoop-2.4 | 30 +++--- dev/deps/spark-deps-hadoop-2.6 | 30 +++--- dev/deps/spark-deps-hadoop-2.7 | 30 +++--- pom.xml| 6 +++--- 4 files changed, 48 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/147c0208/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index f0491ec..501bf58 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -53,21 +53,21 @@ eigenbase-properties-1.1.5.jar guava-14.0.1.jar guice-3.0.jar guice-servlet-3.0.jar -hadoop-annotations-2.4.0.jar -hadoop-auth-2.4.0.jar -hadoop-client-2.4.0.jar -hadoop-common-2.4.0.jar -hadoop-hdfs-2.4.0.jar -hadoop-mapreduce-client-app-2.4.0.jar -hadoop-mapreduce-client-common-2.4.0.jar -hadoop-mapreduce-client-core-2.4.0.jar -hadoop-mapreduce-client-jobclient-2.4.0.jar -hadoop-mapreduce-client-shuffle-2.4.0.jar -hadoop-yarn-api-2.4.0.jar -hadoop-yarn-client-2.4.0.jar -hadoop-yarn-common-2.4.0.jar -hadoop-yarn-server-common-2.4.0.jar -hadoop-yarn-server-web-proxy-2.4.0.jar +hadoop-annotations-2.4.1.jar +hadoop-auth-2.4.1.jar +hadoop-client-2.4.1.jar +hadoop-common-2.4.1.jar +hadoop-hdfs-2.4.1.jar +hadoop-mapreduce-client-app-2.4.1.jar +hadoop-mapreduce-client-common-2.4.1.jar +hadoop-mapreduce-client-core-2.4.1.jar +hadoop-mapreduce-client-jobclient-2.4.1.jar +hadoop-mapreduce-client-shuffle-2.4.1.jar +hadoop-yarn-api-2.4.1.jar +hadoop-yarn-client-2.4.1.jar +hadoop-yarn-common-2.4.1.jar +hadoop-yarn-server-common-2.4.1.jar +hadoop-yarn-server-web-proxy-2.4.1.jar hk2-api-2.4.0-b34.jar hk2-locator-2.4.0-b34.jar hk2-utils-2.4.0-b34.jar http://git-wip-us.apache.org/repos/asf/spark/blob/147c0208/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index b3dced6..b915727 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -58,21 +58,21 @@ gson-2.2.4.jar guava-14.0.1.jar guice-3.0.jar guice-servlet-3.0.jar -hadoop-annotations-2.6.0.jar -hadoop-auth-2.6.0.jar -hadoop-client-2.6.0.jar -hadoop-common-2.6.0.jar -hadoop-hdfs-2.6.0.jar -hadoop-mapreduce-client-app-2.6.0.jar -hadoop-mapreduce-client-common-2.6.0.jar -hadoop-mapreduce-client-core-2.6.0.jar -hadoop-mapreduce-client-jobclient-2.6.0.jar -hadoop-mapreduce-client-shuffle-2.6.0.jar -hadoop-yarn-api-2.6.0.jar -hadoop-yarn-client-2.6.0.jar -hadoop-yarn-common-2.6.0.jar -hadoop-yarn-server-common-2.6.0.jar -hadoop-yarn-server-web-proxy-2.6.0.jar +hadoop-annotations-2.6.4.jar +hadoop-auth-2.6.4.jar +hadoop-client-2.6.4.jar +hadoop-common-2.6.4.jar +hadoop-hdfs-2.6.4.jar +hadoop-mapreduce-client-app-2.6.4.jar +h
spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache
Repository: spark Updated Branches: refs/heads/branch-1.6 5830828ef -> bb917fc65 [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen Closes #13568 from JoshRosen/SPARK-12712. (cherry picked from commit 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb917fc6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb917fc6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb917fc6 Branch: refs/heads/branch-1.6 Commit: bb917fc659ec62718214f2f2fceb03a90515ac3e Parents: 5830828 Author: Josh Rosen Authored: Thu Jun 9 00:51:24 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 00:52:43 2016 -0700 -- dev/test-dependencies.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bb917fc6/dev/test-dependencies.sh -- diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index efb49f7..7367a8b 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -84,7 +84,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do echo "Generating dependency manifest for $HADOOP_PROFILE" mkdir -p dev/pr-deps $MVN $HADOOP_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath -pl assembly \ -| grep "Building Spark Project Assembly" -A 5 \ +| grep "Dependencies classpath:" -A 1 \ | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \ | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE done - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache
Repository: spark Updated Branches: refs/heads/master d5807def1 -> 921fa40b1 [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen Closes #13568 from JoshRosen/SPARK-12712. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/921fa40b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/921fa40b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/921fa40b Branch: refs/heads/master Commit: 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa Parents: d5807de Author: Josh Rosen Authored: Thu Jun 9 00:51:24 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 00:51:24 2016 -0700 -- dev/test-dependencies.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/921fa40b/dev/test-dependencies.sh -- diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 5ea643e..28e3d4d 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -79,7 +79,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do echo "Generating dependency manifest for $HADOOP_PROFILE" mkdir -p dev/pr-deps $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath -pl assembly \ -| grep "Building Spark Project Assembly" -A 5 \ +| grep "Dependencies classpath:" -A 1 \ | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \ | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE done - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache
Repository: spark Updated Branches: refs/heads/branch-2.0 96c011d5b -> 8ee93eed9 [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen Closes #13568 from JoshRosen/SPARK-12712. (cherry picked from commit 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ee93eed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ee93eed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ee93eed Branch: refs/heads/branch-2.0 Commit: 8ee93eed931b185b887882cc77c6fe8ddc907611 Parents: 96c011d Author: Josh Rosen Authored: Thu Jun 9 00:51:24 2016 -0700 Committer: Josh Rosen Committed: Thu Jun 9 00:51:51 2016 -0700 -- dev/test-dependencies.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8ee93eed/dev/test-dependencies.sh -- diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 5ea643e..28e3d4d 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -79,7 +79,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do echo "Generating dependency manifest for $HADOOP_PROFILE" mkdir -p dev/pr-deps $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath -pl assembly \ -| grep "Building Spark Project Assembly" -A 5 \ +| grep "Dependencies classpath:" -A 1 \ | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \ | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE done - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org