spark git commit: [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning
Repository: spark Updated Branches: refs/heads/branch-2.3 bad56bb7b -> aa51c070f [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning ## What changes were proposed in this pull request? Looks we intentionally set `null` for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results. This PR proposes to explicitly whitelist the supported types. ```scala val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df.cache().filter("arrayCol > array('a', 'b')").show() ``` ```scala val df = sql("select cast('a' as binary) as a") df.cache().filter("a == cast('a' as binary)").show() ``` **Before:** ``` ++ |arrayCol| ++ ++ ``` ``` +---+ | a| +---+ +---+ ``` **After:** ``` ++ |arrayCol| ++ | [c, d]| ++ ``` ``` ++ | a| ++ |[61]| ++ ``` ## How was this patch tested? Unit tests were added and manually tested. Author: hyukjinkwon Closes #21882 from HyukjinKwon/stats-filter. (cherry picked from commit bfe60fcdb49aa48534060c38e36e06119900140d) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa51c070 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa51c070 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa51c070 Branch: refs/heads/branch-2.3 Commit: aa51c070f8944fd2aa94ac891b45ff51ffcc1ef2 Parents: bad56bb Author: hyukjinkwon Authored: Mon Jul 30 13:20:03 2018 +0800 Committer: Wenchen Fan Committed: Mon Jul 30 13:20:31 2018 +0800 -- .../columnar/InMemoryTableScanExec.scala| 42 ++-- .../columnar/PartitionBatchPruningSuite.scala | 30 +- 2 files changed, 58 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aa51c070/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala index 08b2751..7bed7e3 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala @@ -183,6 +183,18 @@ case class InMemoryTableScanExec( private val stats = relation.partitionStatistics private def statsFor(a: Attribute) = stats.forAttribute(a) + // Currently, only use statistics from atomic types except binary type only. + private object ExtractableLiteral { +def unapply(expr: Expression): Option[Literal] = expr match { + case lit: Literal => lit.dataType match { +case BinaryType => None +case _: AtomicType => Some(lit) +case _ => None + } + case _ => None +} + } + // Returned filter predicate should return false iff it is impossible for the input expression // to evaluate to `true' based on statistics collected about this partition batch. @transient lazy val buildFilter: PartialFunction[Expression, Expression] = { @@ -194,33 +206,37 @@ case class InMemoryTableScanExec( if buildFilter.isDefinedAt(lhs) && buildFilter.isDefinedAt(rhs) => buildFilter(lhs) || buildFilter(rhs) -case EqualTo(a: AttributeReference, l: Literal) => +case EqualTo(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualTo(l: Literal, a: AttributeReference) => +case EqualTo(ExtractableLiteral(l), a: AttributeReference) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualNullSafe(a: AttributeReference, l: Literal) => +case EqualNullSafe(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualNullSafe(l: Literal, a: AttributeReference) => +case EqualNullSafe(ExtractableLiteral(l), a: AttributeReference) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case LessThan(a: AttributeReference, l: Literal) => statsFor(a).lowerBound < l -case LessThan(l: Literal, a: AttributeReference) => l < statsFor(a).upperBound +case LessThan(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound < l +case LessThan(ExtractableLiteral(l), a: AttributeReference) => l < statsFor(a).upperBound -case LessThanOrEqual(a: AttributeReference, l: Literal) => statsFor(a).lowerBound <= l -case LessThanOrEqual(l: Literal, a: AttributeReference) => l <= statsFor
spark git commit: [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning
Repository: spark Updated Branches: refs/heads/master 65a4bc143 -> bfe60fcdb [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning ## What changes were proposed in this pull request? Looks we intentionally set `null` for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results. This PR proposes to explicitly whitelist the supported types. ```scala val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df.cache().filter("arrayCol > array('a', 'b')").show() ``` ```scala val df = sql("select cast('a' as binary) as a") df.cache().filter("a == cast('a' as binary)").show() ``` **Before:** ``` ++ |arrayCol| ++ ++ ``` ``` +---+ | a| +---+ +---+ ``` **After:** ``` ++ |arrayCol| ++ | [c, d]| ++ ``` ``` ++ | a| ++ |[61]| ++ ``` ## How was this patch tested? Unit tests were added and manually tested. Author: hyukjinkwon Closes #21882 from HyukjinKwon/stats-filter. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfe60fcd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfe60fcd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfe60fcd Branch: refs/heads/master Commit: bfe60fcdb49aa48534060c38e36e06119900140d Parents: 65a4bc1 Author: hyukjinkwon Authored: Mon Jul 30 13:20:03 2018 +0800 Committer: Wenchen Fan Committed: Mon Jul 30 13:20:03 2018 +0800 -- .../columnar/InMemoryTableScanExec.scala| 42 ++-- .../columnar/PartitionBatchPruningSuite.scala | 30 +- 2 files changed, 58 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bfe60fcd/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala index 997cf92..6012aba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala @@ -183,6 +183,18 @@ case class InMemoryTableScanExec( private val stats = relation.partitionStatistics private def statsFor(a: Attribute) = stats.forAttribute(a) + // Currently, only use statistics from atomic types except binary type only. + private object ExtractableLiteral { +def unapply(expr: Expression): Option[Literal] = expr match { + case lit: Literal => lit.dataType match { +case BinaryType => None +case _: AtomicType => Some(lit) +case _ => None + } + case _ => None +} + } + // Returned filter predicate should return false iff it is impossible for the input expression // to evaluate to `true' based on statistics collected about this partition batch. @transient lazy val buildFilter: PartialFunction[Expression, Expression] = { @@ -194,33 +206,37 @@ case class InMemoryTableScanExec( if buildFilter.isDefinedAt(lhs) && buildFilter.isDefinedAt(rhs) => buildFilter(lhs) || buildFilter(rhs) -case EqualTo(a: AttributeReference, l: Literal) => +case EqualTo(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualTo(l: Literal, a: AttributeReference) => +case EqualTo(ExtractableLiteral(l), a: AttributeReference) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualNullSafe(a: AttributeReference, l: Literal) => +case EqualNullSafe(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case EqualNullSafe(l: Literal, a: AttributeReference) => +case EqualNullSafe(ExtractableLiteral(l), a: AttributeReference) => statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound -case LessThan(a: AttributeReference, l: Literal) => statsFor(a).lowerBound < l -case LessThan(l: Literal, a: AttributeReference) => l < statsFor(a).upperBound +case LessThan(a: AttributeReference, ExtractableLiteral(l)) => statsFor(a).lowerBound < l +case LessThan(ExtractableLiteral(l), a: AttributeReference) => l < statsFor(a).upperBound -case LessThanOrEqual(a: AttributeReference, l: Literal) => statsFor(a).lowerBound <= l -case LessThanOrEqual(l: Literal, a: AttributeReference) => l <= statsFor(a).upperBound +case LessThanOrEqual(a: AttributeReference, ExtractableLiteral(l)) => + statsFor(
svn commit: r28425 - in /dev/spark/2.3.3-SNAPSHOT-2018_07_29_22_01-bad56bb-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Mon Jul 30 05:15:24 2018 New Revision: 28425 Log: Apache Spark 2.3.3-SNAPSHOT-2018_07_29_22_01-bad56bb docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21274][SQL] Implement INTERSECT ALL clause
Repository: spark Updated Branches: refs/heads/master 6690924c4 -> 65a4bc143 [SPARK-21274][SQL] Implement INTERSECT ALL clause ## What changes were proposed in this pull request? Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Input Query ``` SQL SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT c1, true as vcol1, null as vcol2 FROM ut1 UNION ALL SELECT c1, null as vcol1, true as vcol2 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite Author: Dilip Biswal Closes #21886 from dilipbiswal/dkb_intersect_all_final. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65a4bc14 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65a4bc14 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65a4bc14 Branch: refs/heads/master Commit: 65a4bc143ab5dc2ced589dc107bbafa8a7290931 Parents: 6690924 Author: Dilip Biswal Authored: Sun Jul 29 22:11:01 2018 -0700 Committer: Xiao Li Committed: Sun Jul 29 22:11:01 2018 -0700 -- python/pyspark/sql/dataframe.py | 22 ++ .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../sql/catalyst/analysis/TypeCoercion.scala| 4 +- .../analysis/UnsupportedOperationChecker.scala | 2 +- .../sql/catalyst/optimizer/Optimizer.scala | 81 ++- .../spark/sql/catalyst/parser/AstBuilder.scala | 2 +- .../plans/logical/basicLogicalOperators.scala | 7 +- .../catalyst/optimizer/SetOperationSuite.scala | 32 ++- .../sql/catalyst/parser/PlanParserSuite.scala | 1 - .../scala/org/apache/spark/sql/Dataset.scala| 19 +- .../spark/sql/execution/SparkStrategies.scala | 8 +- .../sql-tests/inputs/intersect-all.sql | 123 ++ .../sql-tests/results/intersect-all.sql.out | 241 +++ .../org/apache/spark/sql/DataFrameSuite.scala | 54 + .../org/apache/spark/sql/test/SQLTestData.scala | 13 + 15 files changed, 599 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/65a4bc14/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index b2e0a5b..07fb260 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -1500,6 +1500,28 @@ class DataFrame(object): """ return DataFrame(self._jdf.intersect(other._jdf), self.sql_ctx) +@since(2.4) +def intersectAll(self, other): +""" Return a new :class:`DataFrame` containing rows in both this dataframe and other +dataframe while preserving duplicates. + +This is equivalent to `INTERSECT ALL` in SQL. +>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"]) +>>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"]) + +>>> df1.intersectAll(df2).sort("C1", "C2").show() ++---+---+ +| C1| C2| ++---+---+ +| a| 1| +| a| 1| +| b| 3| ++---+---+ + +Also as standard in SQL, this function resolves columns by position (not by name). +""" +return DataFrame(self._jdf.intersectAll(other._jdf), self.sql_ctx) + @since(1.3) def subtract(self, other): """ Return a new :class:`DataFrame` containing rows in this frame http://git-wip-us.apache.org/repos/asf/spark/blob/65a4bc14/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 8abb1c7..9965cd6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -914,7 +914,7 @@ cl
svn commit: r28421 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_29_20_02-6690924-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Mon Jul 30 03:16:25 2018 New Revision: 28421 Log: Apache Spark 2.4.0-SNAPSHOT-2018_07_29_20_02-6690924 docs [This commit notification would consist of 1470 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] Avoid the 'latest' link that might vary per release in functions.scala's comment
Repository: spark Updated Branches: refs/heads/master 3210121fe -> 6690924c4 [MINOR] Avoid the 'latest' link that might vary per release in functions.scala's comment ## What changes were proposed in this pull request? This PR propose to address https://github.com/apache/spark/pull/21318#discussion_r187843125 comment. This is rather a nit but looks we better avoid to update the link for each release since it always points the latest (it doesn't look like worth enough updating release guide on the other hand as well). ## How was this patch tested? N/A Author: hyukjinkwon Closes #21907 from HyukjinKwon/minor-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6690924c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6690924c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6690924c Branch: refs/heads/master Commit: 6690924c49a443cd629fcc1a4460cf443fb0a918 Parents: 3210121 Author: hyukjinkwon Authored: Mon Jul 30 10:02:29 2018 +0800 Committer: hyukjinkwon Committed: Mon Jul 30 10:02:29 2018 +0800 -- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6690924c/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 2772958..a2d3792 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -44,8 +44,8 @@ import org.apache.spark.util.Utils * * Spark also includes more built-in functions that are less common and are not defined here. * You can still access them (and all the functions defined here) using the `functions.expr()` API - * and calling them through a SQL expression string. You can find the entire list of functions for - * the latest version of Spark at https://spark.apache.org/docs/latest/api/sql/index.html. + * and calling them through a SQL expression string. You can find the entire list of functions + * at SQL API documentation. * * As an example, `isnan` is a function that is defined here. You can use `isnan(col("myCol"))` * to invoke the `isnan` function. This way the programming language's compiler ensures `isnan` - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml
Repository: spark Updated Branches: refs/heads/master 3695ba577 -> 3210121fe [MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml ## What changes were proposed in this pull request? This PR propose to remove `-Phive-thriftserver` profile which seems not affecting the SparkR tests in AppVeyor. Originally wanted to check if there's a meaningful build time decrease but seems not. It will have but seems not meaningfully decreased. ## How was this patch tested? AppVeyor tests: ``` [00:40:49] Attaching package: 'SparkR' [00:40:49] [00:40:49] The following objects are masked from 'package:testthat': [00:40:49] [00:40:49] describe, not [00:40:49] [00:40:49] The following objects are masked from 'package:stats': [00:40:49] [00:40:49] cov, filter, lag, na.omit, predict, sd, var, window [00:40:49] [00:40:49] The following objects are masked from 'package:base': [00:40:49] [00:40:49] as.data.frame, colnames, colnames<-, drop, endsWith, intersect, [00:40:49] rank, rbind, sample, startsWith, subset, summary, transform, union [00:40:49] [00:40:49] Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:41:43] basic tests for CRAN: . [00:41:43] [00:41:43] DONE === [00:41:43] binary functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:05] ... [00:42:05] functions on binary files: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:10] [00:42:10] broadcast variables: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:12] .. [00:42:12] functions in client.R: . [00:42:30] test functions in sparkR.R: .. [00:42:30] include R packages: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] [00:42:31] JVM API: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] .. [00:42:31] MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:48:48] .. [00:48:48] MLlib clustering algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:12] . [00:50:12] MLlib frequent pattern mining: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:18] . [00:50:18] MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:27] [00:50:27] MLlib regression algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:00] [00:56:00] MLlib statistics algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:04] [00:56:04] MLlib tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] .. [00:58:20] parallelize() and collect(): Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] . [00:58:20] basic RDD functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:35] [01:03:35] SerDe functionality: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:39] ... [01:03:39] partitionBy, groupByKey, reduceByKey etc.: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:20] [01:04:20] functions in sparkR.R: [01:04:20] SparkSQL functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:51] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:51] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:13] ...
spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite
Repository: spark Updated Branches: refs/heads/branch-2.3 71eb7d468 -> bad56bb7b [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite ## What changes were proposed in this pull request? In the `afterEach()` method of both `TastSetManagerSuite` and `TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, because it shall stop the SparkContext. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ The test failure is caused by the above reason, the newly added `barrierCoordinator` required `rpcEnv` which has been stopped before `TaskSchedulerImpl` doing cleanup. ## How was this patch tested? Existing tests. Author: Xingbo Jiang Closes #21908 from jiangxb1987/afterEach. (cherry picked from commit 3695ba57731a669ed20e7f676edee602c292fbed) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bad56bb7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bad56bb7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bad56bb7 Branch: refs/heads/branch-2.3 Commit: bad56bb7b2340d338eac8cea07e9f1bb3e08b1ac Parents: 71eb7d4 Author: Xingbo Jiang Authored: Mon Jul 30 09:58:28 2018 +0800 Committer: hyukjinkwon Committed: Mon Jul 30 09:58:54 2018 +0800 -- .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala | 2 +- .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bad56bb7/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala index 33f2ea1..a9e0aed 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala @@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B } override def afterEach(): Unit = { -super.afterEach() if (taskScheduler != null) { taskScheduler.stop() taskScheduler = null @@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B dagScheduler.stop() dagScheduler = null } +super.afterEach() } def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = { http://git-wip-us.apache.org/repos/asf/spark/blob/bad56bb7/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index b4acccf..d75c245 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -178,12 +178,12 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg } override def afterEach(): Unit = { -super.afterEach() if (sched != null) { sched.dagScheduler.stop() sched.stop() sched = null } +super.afterEach() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite
Repository: spark Updated Branches: refs/heads/branch-2.2 f52d0c451 -> c4b37696f [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite ## What changes were proposed in this pull request? In the `afterEach()` method of both `TastSetManagerSuite` and `TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, because it shall stop the SparkContext. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ The test failure is caused by the above reason, the newly added `barrierCoordinator` required `rpcEnv` which has been stopped before `TaskSchedulerImpl` doing cleanup. ## How was this patch tested? Existing tests. Author: Xingbo Jiang Closes #21908 from jiangxb1987/afterEach. (cherry picked from commit 3695ba57731a669ed20e7f676edee602c292fbed) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4b37696 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4b37696 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4b37696 Branch: refs/heads/branch-2.2 Commit: c4b37696f9551a6429c1d90e53f2499aada556b1 Parents: f52d0c4 Author: Xingbo Jiang Authored: Mon Jul 30 09:58:28 2018 +0800 Committer: hyukjinkwon Committed: Mon Jul 30 09:59:15 2018 +0800 -- .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala | 2 +- .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c4b37696/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala index 38a4f40..2da9e17 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala @@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B } override def afterEach(): Unit = { -super.afterEach() if (taskScheduler != null) { taskScheduler.stop() taskScheduler = null @@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B dagScheduler.stop() dagScheduler = null } +super.afterEach() } def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = { http://git-wip-us.apache.org/repos/asf/spark/blob/c4b37696/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index 904f0b6..4d330b5 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -172,12 +172,12 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg } override def afterEach(): Unit = { -super.afterEach() if (sched != null) { sched.dagScheduler.stop() sched.stop() sched = null } +super.afterEach() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite
Repository: spark Updated Branches: refs/heads/master 2c54aae1b -> 3695ba577 [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite ## What changes were proposed in this pull request? In the `afterEach()` method of both `TastSetManagerSuite` and `TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, because it shall stop the SparkContext. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ The test failure is caused by the above reason, the newly added `barrierCoordinator` required `rpcEnv` which has been stopped before `TaskSchedulerImpl` doing cleanup. ## How was this patch tested? Existing tests. Author: Xingbo Jiang Closes #21908 from jiangxb1987/afterEach. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3695ba57 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3695ba57 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3695ba57 Branch: refs/heads/master Commit: 3695ba57731a669ed20e7f676edee602c292fbed Parents: 2c54aae Author: Xingbo Jiang Authored: Mon Jul 30 09:58:28 2018 +0800 Committer: hyukjinkwon Committed: Mon Jul 30 09:58:28 2018 +0800 -- .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala | 2 +- .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3695ba57/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala index 624384a..16c273b 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala @@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B } override def afterEach(): Unit = { -super.afterEach() if (taskScheduler != null) { taskScheduler.stop() taskScheduler = null @@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B dagScheduler.stop() dagScheduler = null } +super.afterEach() } def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = { http://git-wip-us.apache.org/repos/asf/spark/blob/3695ba57/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index cf05434..d264ada 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -178,12 +178,12 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg } override def afterEach(): Unit = { -super.afterEach() if (sched != null) { sched.dagScheduler.stop() sched.stop() sched = null } +super.afterEach() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r28420 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_29_16_02-2c54aae-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Sun Jul 29 23:16:21 2018 New Revision: 28420 Log: Apache Spark 2.4.0-SNAPSHOT-2018_07_29_16_02-2c54aae docs [This commit notification would consist of 1470 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r28419 - in /dev/spark/2.3.3-SNAPSHOT-2018_07_29_14_01-71eb7d4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Sun Jul 29 21:15:26 2018 New Revision: 28419 Log: Apache Spark 2.3.3-SNAPSHOT-2018_07_29_14_01-71eb7d4 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error
Repository: spark Updated Branches: refs/heads/branch-2.1 7d50fec3f -> a3eb07db3 [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia Closes #21772 from liutang123/SPARK-24809. (cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab) Signed-off-by: Xiao Li Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3eb07db Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3eb07db Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3eb07db Branch: refs/heads/branch-2.1 Commit: a3eb07db3be80be663ca66f1e9a11fcef8ab6c20 Parents: 7d50fec Author: liulijia Authored: Sun Jul 29 13:13:00 2018 -0700 Committer: Xiao Li Committed: Sun Jul 29 13:14:57 2018 -0700 -- .../sql/execution/joins/HashedRelation.scala| 2 ++ .../execution/joins/HashedRelationSuite.scala | 29 2 files changed, 31 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a3eb07db/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala index f7e8ea6..206afcd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala @@ -741,6 +741,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap array = readLongArray(readBuffer, length) val pageLength = readLong().toInt page = readLongArray(readBuffer, pageLength) +// Restore cursor variable to make this map able to be serialized again on executors. +cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET } override def readExternal(in: ObjectInput): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/a3eb07db/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index f0288c8..9c9e9dc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -277,6 +277,35 @@ class HashedRelationSuite extends SparkFunSuite with SharedSQLContext { map.free() } + test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in data error") { +val unsafeProj = UnsafeProjection.create(Array[DataType](LongType)) +val originalMap = new LongToUnsafeRowMap(mm, 1) + +val key1 = 1L +val value1 = 4852306286022334418L + +val key2 = 2L +val value2 = 8813607448788216010L + +originalMap.append(key1, unsafeProj(InternalRow(value1))) +originalMap.append(key2, unsafeProj(InternalRow(value2))) +originalMap.optimize() + +val ser = sparkContext.env.serializer.newInstance() +// Simulate serialize/deserialize twice on driver and executor +val firstTimeSerialized = ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap)) +val secondTimeSerialized = + ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized)) + +val resultRow = new UnsafeRow(1) +assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === value1) +assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === value2) + +originalMap.free() +firstTimeSerialized.free() +secondTimeSerialized.free() + } + test("Spark-14521") { val ser = new KryoSerializer( (new SparkConf).set("spark.kryo.referenceTracking", "false")).newInstance() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.
spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error
Repository: spark Updated Branches: refs/heads/branch-2.3 d5f340f27 -> 71eb7d468 [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia Closes #21772 from liutang123/SPARK-24809. (cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab) Signed-off-by: Xiao Li Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71eb7d46 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71eb7d46 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71eb7d46 Branch: refs/heads/branch-2.3 Commit: 71eb7d4682a7e85e4de580ffe110da961d84817f Parents: d5f340f Author: liulijia Authored: Sun Jul 29 13:13:00 2018 -0700 Committer: Xiao Li Committed: Sun Jul 29 13:13:22 2018 -0700 -- .../sql/execution/joins/HashedRelation.scala| 2 ++ .../execution/joins/HashedRelationSuite.scala | 29 2 files changed, 31 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/71eb7d46/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala index 20ce01f..86eb47a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala @@ -772,6 +772,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap array = readLongArray(readBuffer, length) val pageLength = readLong().toInt page = readLongArray(readBuffer, pageLength) +// Restore cursor variable to make this map able to be serialized again on executors. +cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET } override def readExternal(in: ObjectInput): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/71eb7d46/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index 037cc2e..d9b34dc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -278,6 +278,35 @@ class HashedRelationSuite extends SparkFunSuite with SharedSQLContext { map.free() } + test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in data error") { +val unsafeProj = UnsafeProjection.create(Array[DataType](LongType)) +val originalMap = new LongToUnsafeRowMap(mm, 1) + +val key1 = 1L +val value1 = 4852306286022334418L + +val key2 = 2L +val value2 = 8813607448788216010L + +originalMap.append(key1, unsafeProj(InternalRow(value1))) +originalMap.append(key2, unsafeProj(InternalRow(value2))) +originalMap.optimize() + +val ser = sparkContext.env.serializer.newInstance() +// Simulate serialize/deserialize twice on driver and executor +val firstTimeSerialized = ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap)) +val secondTimeSerialized = + ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized)) + +val resultRow = new UnsafeRow(1) +assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === value1) +assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === value2) + +originalMap.free() +firstTimeSerialized.free() +secondTimeSerialized.free() + } + test("Spark-14521") { val ser = new KryoSerializer( (new SparkConf).set("spark.kryo.referenceTracking", "false")).newInstance() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.
spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error
Repository: spark Updated Branches: refs/heads/branch-2.2 73764737d -> f52d0c451 [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia Closes #21772 from liutang123/SPARK-24809. (cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab) Signed-off-by: Xiao Li Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f52d0c45 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f52d0c45 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f52d0c45 Branch: refs/heads/branch-2.2 Commit: f52d0c4515f3f0ceaea03c661fb7739c70c25236 Parents: 7376473 Author: liulijia Authored: Sun Jul 29 13:13:00 2018 -0700 Committer: Xiao Li Committed: Sun Jul 29 13:13:57 2018 -0700 -- .../sql/execution/joins/HashedRelation.scala| 2 ++ .../execution/joins/HashedRelationSuite.scala | 29 2 files changed, 31 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f52d0c45/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala index 07ee3d0..78190bf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala @@ -741,6 +741,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap array = readLongArray(readBuffer, length) val pageLength = readLong().toInt page = readLongArray(readBuffer, pageLength) +// Restore cursor variable to make this map able to be serialized again on executors. +cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET } override def readExternal(in: ObjectInput): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/f52d0c45/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index f0288c8..9c9e9dc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -277,6 +277,35 @@ class HashedRelationSuite extends SparkFunSuite with SharedSQLContext { map.free() } + test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in data error") { +val unsafeProj = UnsafeProjection.create(Array[DataType](LongType)) +val originalMap = new LongToUnsafeRowMap(mm, 1) + +val key1 = 1L +val value1 = 4852306286022334418L + +val key2 = 2L +val value2 = 8813607448788216010L + +originalMap.append(key1, unsafeProj(InternalRow(value1))) +originalMap.append(key2, unsafeProj(InternalRow(value2))) +originalMap.optimize() + +val ser = sparkContext.env.serializer.newInstance() +// Simulate serialize/deserialize twice on driver and executor +val firstTimeSerialized = ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap)) +val secondTimeSerialized = + ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized)) + +val resultRow = new UnsafeRow(1) +assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === value1) +assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === value2) + +originalMap.free() +firstTimeSerialized.free() +secondTimeSerialized.free() + } + test("Spark-14521") { val ser = new KryoSerializer( (new SparkConf).set("spark.kryo.referenceTracking", "false")).newInstance() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.
spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error
Repository: spark Updated Branches: refs/heads/master 8fe5d2c39 -> 2c54aae1b [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia Closes #21772 from liutang123/SPARK-24809. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c54aae1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c54aae1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c54aae1 Branch: refs/heads/master Commit: 2c54aae1bc2fa3da26917c89e6201fb2108d9fab Parents: 8fe5d2c Author: liulijia Authored: Sun Jul 29 13:13:00 2018 -0700 Committer: Xiao Li Committed: Sun Jul 29 13:13:00 2018 -0700 -- .../sql/execution/joins/HashedRelation.scala| 2 ++ .../execution/joins/HashedRelationSuite.scala | 29 2 files changed, 31 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2c54aae1/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala index 20ce01f..86eb47a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala @@ -772,6 +772,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap array = readLongArray(readBuffer, length) val pageLength = readLong().toInt page = readLongArray(readBuffer, pageLength) +// Restore cursor variable to make this map able to be serialized again on executors. +cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET } override def readExternal(in: ObjectInput): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/2c54aae1/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index 037cc2e..d9b34dc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -278,6 +278,35 @@ class HashedRelationSuite extends SparkFunSuite with SharedSQLContext { map.free() } + test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in data error") { +val unsafeProj = UnsafeProjection.create(Array[DataType](LongType)) +val originalMap = new LongToUnsafeRowMap(mm, 1) + +val key1 = 1L +val value1 = 4852306286022334418L + +val key2 = 2L +val value2 = 8813607448788216010L + +originalMap.append(key1, unsafeProj(InternalRow(value1))) +originalMap.append(key2, unsafeProj(InternalRow(value2))) +originalMap.optimize() + +val ser = sparkContext.env.serializer.newInstance() +// Simulate serialize/deserialize twice on driver and executor +val firstTimeSerialized = ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap)) +val secondTimeSerialized = + ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized)) + +val resultRow = new UnsafeRow(1) +assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === value1) +assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === value2) + +originalMap.free() +firstTimeSerialized.free() +secondTimeSerialized.free() + } + test("Spark-14521") { val ser = new KryoSerializer( (new SparkConf).set("spark.kryo.referenceTracking", "false")).newInstance() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r28416 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_29_08_02-8fe5d2c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Sun Jul 29 15:17:11 2018 New Revision: 28416 Log: Apache Spark 2.4.0-SNAPSHOT-2018_07_29_08_02-8fe5d2c docs [This commit notification would consist of 1470 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark-website git commit: (Forgot Spark version change in last commit)
Repository: spark-website Updated Branches: refs/heads/asf-site 50b4660ce -> 03f5adcb8 (Forgot Spark version change in last commit) Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/03f5adcb Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/03f5adcb Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/03f5adcb Branch: refs/heads/asf-site Commit: 03f5adcb8d9f6e8e7fb61f176bc1d28ff67da3fe Parents: 50b4660 Author: Sean Owen Authored: Sun Jul 29 09:18:14 2018 -0500 Committer: Sean Owen Committed: Sun Jul 29 09:18:14 2018 -0500 -- site/versioning-policy.html | 2 +- versioning-policy.md| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/03f5adcb/site/versioning-policy.html -- diff --git a/site/versioning-policy.html b/site/versioning-policy.html index e660c3b..08d251d 100644 --- a/site/versioning-policy.html +++ b/site/versioning-policy.html @@ -252,7 +252,7 @@ generally be released about 6 months after 2.2.0. Maintenance releases happen as A minor release usually sees 1-2 maintenance releases in the 6 months following its first release. Major releases do not happen according to a fixed schedule. -Spark 2.3 Release Window +Spark 2.4 Release Window http://git-wip-us.apache.org/repos/asf/spark-website/blob/03f5adcb/versioning-policy.md -- diff --git a/versioning-policy.md b/versioning-policy.md index 22e9da9..60e0b82 100644 --- a/versioning-policy.md +++ b/versioning-policy.md @@ -57,7 +57,7 @@ generally be released about 6 months after 2.2.0. Maintenance releases happen as A minor release usually sees 1-2 maintenance releases in the 6 months following its first release. Major releases do not happen according to a fixed schedule. -Spark 2.3 Release Window +Spark 2.4 Release Window | Date | | Event | | - |-| -- | - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark-website git commit: Update Spark 2.4 release window (and fix Spark URLs in sitemap)
Repository: spark-website Updated Branches: refs/heads/asf-site d86cffd19 -> 50b4660ce Update Spark 2.4 release window (and fix Spark URLs in sitemap) Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/50b4660c Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/50b4660c Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/50b4660c Branch: refs/heads/asf-site Commit: 50b4660ce81f04fe34b995f3e9f0a74e336f482c Parents: d86cffd Author: Sean Owen Authored: Sun Jul 29 09:16:53 2018 -0500 Committer: Sean Owen Committed: Sun Jul 29 09:16:53 2018 -0500 -- site/mailing-lists.html | 2 +- site/sitemap.xml| 318 +++ site/versioning-policy.html | 6 +- versioning-policy.md| 6 +- 4 files changed, 166 insertions(+), 166 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/50b4660c/site/mailing-lists.html -- diff --git a/site/mailing-lists.html b/site/mailing-lists.html index d447046..f7ae56f 100644 --- a/site/mailing-lists.html +++ b/site/mailing-lists.html @@ -12,7 +12,7 @@ -http://localhost:4000/community.html"; /> +https://spark.apache.org/community.html"; /> http://git-wip-us.apache.org/repos/asf/spark-website/blob/50b4660c/site/sitemap.xml -- diff --git a/site/sitemap.xml b/site/sitemap.xml index dd69976..87ca6f6 100644 --- a/site/sitemap.xml +++ b/site/sitemap.xml @@ -139,641 +139,641 @@ - http://localhost:4000/news/spark-summit-oct-2018-agenda-posted.html + https://spark.apache.org/news/spark-summit-oct-2018-agenda-posted.html weekly - http://localhost:4000/releases/spark-release-2-2-2.html + https://spark.apache.org/releases/spark-release-2-2-2.html weekly - http://localhost:4000/news/spark-2-2-2-released.html + https://spark.apache.org/news/spark-2-2-2-released.html weekly - http://localhost:4000/releases/spark-release-2-1-3.html + https://spark.apache.org/releases/spark-release-2-1-3.html weekly - http://localhost:4000/news/spark-2-1-3-released.html + https://spark.apache.org/news/spark-2-1-3-released.html weekly - http://localhost:4000/releases/spark-release-2-3-1.html + https://spark.apache.org/releases/spark-release-2-3-1.html weekly - http://localhost:4000/news/spark-2-3-1-released.html + https://spark.apache.org/news/spark-2-3-1-released.html weekly - http://localhost:4000/news/spark-summit-june-2018-agenda-posted.html + https://spark.apache.org/news/spark-summit-june-2018-agenda-posted.html weekly - http://localhost:4000/releases/spark-release-2-3-0.html + https://spark.apache.org/releases/spark-release-2-3-0.html weekly - http://localhost:4000/news/spark-2-3-0-released.html + https://spark.apache.org/news/spark-2-3-0-released.html weekly - http://localhost:4000/releases/spark-release-2-2-1.html + https://spark.apache.org/releases/spark-release-2-2-1.html weekly - http://localhost:4000/news/spark-2-2-1-released.html + https://spark.apache.org/news/spark-2-2-1-released.html weekly - http://localhost:4000/releases/spark-release-2-1-2.html + https://spark.apache.org/releases/spark-release-2-1-2.html weekly - http://localhost:4000/news/spark-2-1-2-released.html + https://spark.apache.org/news/spark-2-1-2-released.html weekly - http://localhost:4000/news/spark-summit-eu-2017-agenda-posted.html + https://spark.apache.org/news/spark-summit-eu-2017-agenda-posted.html weekly - http://localhost:4000/releases/spark-release-2-2-0.html + https://spark.apache.org/releases/spark-release-2-2-0.html weekly - http://localhost:4000/news/spark-2-2-0-released.html + https://spark.apache.org/news/spark-2-2-0-released.html weekly - http://localhost:4000/releases/spark-release-2-1-1.html + https://spark.apache.org/releases/spark-release-2-1-1.html weekly - http://localhost:4000/news/spark-2-1-1-released.html + https://spark.apache.org/news/spark-2-1-1-released.html weekly - http://localhost:4000/news/spark-summit-june-2017-agenda-posted.html + https://spark.apache.org/news/spark-summit-june-2017-agenda-posted.html weekly - http://localhost:4000/news/spark-summit-east-2017-agenda-posted.html + https://spark.apache.org/news/spark-summit-east-2017-agenda-posted.html weekly - http://localhost:4000/releases/spark-release-2-1-0.html + https://spark.apache.org/releases/spark-release-2-1-0.html weekly - http://localhost:4000/news/spark-2-1-0-released.html + https://spark.apache.org/news/spark-2-1-0-released.html weekly
spark git commit: [SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4
Repository: spark Updated Branches: refs/heads/master c5b8d54c6 -> 8fe5d2c39 [SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4 ## What changes were proposed in this pull request? This PR updates maven version from 3.3.9 to 3.5.4. The current build process uses mvn 3.3.9 that was release on 2015, which looks pretty old. We met [an issue](https://issues.apache.org/jira/browse/SPARK-24895) to need the maven 3.5.2 or later. The release note of the 3.5.4 is [here](https://maven.apache.org/docs/3.5.4/release-notes.html). Note version 3.4 was skipped. >From [the release note of the >3.5.0](https://maven.apache.org/docs/3.5.0/release-notes.html), the followings >are new features: 1. ANSI color logging for improved output visibility 1. add support for module name != artifactId in every calculated URLs (project, SCM, site): special project.directory property 1. create a slf4j-simple provider extension that supports level color rendering 1. ModelResolver interface enhancement: addition of resolveModel(Dependency) supporting version ranges ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki Closes #21905 from kiszk/SPARK-24956. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8fe5d2c3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8fe5d2c3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8fe5d2c3 Branch: refs/heads/master Commit: 8fe5d2c393f035b9e82ba42202421c9ba66d6c78 Parents: c5b8d54 Author: Kazuaki Ishizaki Authored: Sun Jul 29 08:31:16 2018 -0500 Committer: Sean Owen Committed: Sun Jul 29 08:31:16 2018 -0500 -- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8fe5d2c3/pom.xml -- diff --git a/pom.xml b/pom.xml index f320844..9f60edc 100644 --- a/pom.xml +++ b/pom.xml @@ -114,7 +114,7 @@ 1.8 ${java.version} ${java.version} -3.3.9 +3.5.4 spark 1.7.16 1.2.17 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org