spark git commit: [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning

2018-07-29 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 bad56bb7b -> aa51c070f


[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds 
for in-memory partition pruning

## What changes were proposed in this pull request?

Looks we intentionally set `null` for upper/lower bounds for complex types and 
don't use it. However, these look used in in-memory partition pruning, which 
ends up with incorrect results.

This PR proposes to explicitly whitelist the supported types.

```scala
val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df.cache().filter("arrayCol > array('a', 'b')").show()
```

```scala
val df = sql("select cast('a' as binary) as a")
df.cache().filter("a == cast('a' as binary)").show()
```

**Before:**

```
++
|arrayCol|
++
++
```

```
+---+
|  a|
+---+
+---+
```

**After:**

```
++
|arrayCol|
++
|  [c, d]|
++
```

```
++
|   a|
++
|[61]|
++
```

## How was this patch tested?

Unit tests were added and manually tested.

Author: hyukjinkwon 

Closes #21882 from HyukjinKwon/stats-filter.

(cherry picked from commit bfe60fcdb49aa48534060c38e36e06119900140d)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa51c070
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa51c070
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa51c070

Branch: refs/heads/branch-2.3
Commit: aa51c070f8944fd2aa94ac891b45ff51ffcc1ef2
Parents: bad56bb
Author: hyukjinkwon 
Authored: Mon Jul 30 13:20:03 2018 +0800
Committer: Wenchen Fan 
Committed: Mon Jul 30 13:20:31 2018 +0800

--
 .../columnar/InMemoryTableScanExec.scala| 42 ++--
 .../columnar/PartitionBatchPruningSuite.scala   | 30 +-
 2 files changed, 58 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa51c070/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
index 08b2751..7bed7e3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
@@ -183,6 +183,18 @@ case class InMemoryTableScanExec(
   private val stats = relation.partitionStatistics
   private def statsFor(a: Attribute) = stats.forAttribute(a)
 
+  // Currently, only use statistics from atomic types except binary type only.
+  private object ExtractableLiteral {
+def unapply(expr: Expression): Option[Literal] = expr match {
+  case lit: Literal => lit.dataType match {
+case BinaryType => None
+case _: AtomicType => Some(lit)
+case _ => None
+  }
+  case _ => None
+}
+  }
+
   // Returned filter predicate should return false iff it is impossible for 
the input expression
   // to evaluate to `true' based on statistics collected about this partition 
batch.
   @transient lazy val buildFilter: PartialFunction[Expression, Expression] = {
@@ -194,33 +206,37 @@ case class InMemoryTableScanExec(
   if buildFilter.isDefinedAt(lhs) && buildFilter.isDefinedAt(rhs) =>
   buildFilter(lhs) || buildFilter(rhs)
 
-case EqualTo(a: AttributeReference, l: Literal) =>
+case EqualTo(a: AttributeReference, ExtractableLiteral(l)) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
-case EqualTo(l: Literal, a: AttributeReference) =>
+case EqualTo(ExtractableLiteral(l), a: AttributeReference) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
 
-case EqualNullSafe(a: AttributeReference, l: Literal) =>
+case EqualNullSafe(a: AttributeReference, ExtractableLiteral(l)) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
-case EqualNullSafe(l: Literal, a: AttributeReference) =>
+case EqualNullSafe(ExtractableLiteral(l), a: AttributeReference) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
 
-case LessThan(a: AttributeReference, l: Literal) => statsFor(a).lowerBound 
< l
-case LessThan(l: Literal, a: AttributeReference) => l < 
statsFor(a).upperBound
+case LessThan(a: AttributeReference, ExtractableLiteral(l)) => 
statsFor(a).lowerBound < l
+case LessThan(ExtractableLiteral(l), a: AttributeReference) => l < 
statsFor(a).upperBound
 
-case LessThanOrEqual(a: AttributeReference, l: Literal) => 
statsFor(a).lowerBound <= l
-case LessThanOrEqual(l: Literal, a: AttributeReference) => l <= 

spark git commit: [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning

2018-07-29 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 65a4bc143 -> bfe60fcdb


[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds 
for in-memory partition pruning

## What changes were proposed in this pull request?

Looks we intentionally set `null` for upper/lower bounds for complex types and 
don't use it. However, these look used in in-memory partition pruning, which 
ends up with incorrect results.

This PR proposes to explicitly whitelist the supported types.

```scala
val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df.cache().filter("arrayCol > array('a', 'b')").show()
```

```scala
val df = sql("select cast('a' as binary) as a")
df.cache().filter("a == cast('a' as binary)").show()
```

**Before:**

```
++
|arrayCol|
++
++
```

```
+---+
|  a|
+---+
+---+
```

**After:**

```
++
|arrayCol|
++
|  [c, d]|
++
```

```
++
|   a|
++
|[61]|
++
```

## How was this patch tested?

Unit tests were added and manually tested.

Author: hyukjinkwon 

Closes #21882 from HyukjinKwon/stats-filter.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfe60fcd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfe60fcd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfe60fcd

Branch: refs/heads/master
Commit: bfe60fcdb49aa48534060c38e36e06119900140d
Parents: 65a4bc1
Author: hyukjinkwon 
Authored: Mon Jul 30 13:20:03 2018 +0800
Committer: Wenchen Fan 
Committed: Mon Jul 30 13:20:03 2018 +0800

--
 .../columnar/InMemoryTableScanExec.scala| 42 ++--
 .../columnar/PartitionBatchPruningSuite.scala   | 30 +-
 2 files changed, 58 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bfe60fcd/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
index 997cf92..6012aba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
@@ -183,6 +183,18 @@ case class InMemoryTableScanExec(
   private val stats = relation.partitionStatistics
   private def statsFor(a: Attribute) = stats.forAttribute(a)
 
+  // Currently, only use statistics from atomic types except binary type only.
+  private object ExtractableLiteral {
+def unapply(expr: Expression): Option[Literal] = expr match {
+  case lit: Literal => lit.dataType match {
+case BinaryType => None
+case _: AtomicType => Some(lit)
+case _ => None
+  }
+  case _ => None
+}
+  }
+
   // Returned filter predicate should return false iff it is impossible for 
the input expression
   // to evaluate to `true' based on statistics collected about this partition 
batch.
   @transient lazy val buildFilter: PartialFunction[Expression, Expression] = {
@@ -194,33 +206,37 @@ case class InMemoryTableScanExec(
   if buildFilter.isDefinedAt(lhs) && buildFilter.isDefinedAt(rhs) =>
   buildFilter(lhs) || buildFilter(rhs)
 
-case EqualTo(a: AttributeReference, l: Literal) =>
+case EqualTo(a: AttributeReference, ExtractableLiteral(l)) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
-case EqualTo(l: Literal, a: AttributeReference) =>
+case EqualTo(ExtractableLiteral(l), a: AttributeReference) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
 
-case EqualNullSafe(a: AttributeReference, l: Literal) =>
+case EqualNullSafe(a: AttributeReference, ExtractableLiteral(l)) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
-case EqualNullSafe(l: Literal, a: AttributeReference) =>
+case EqualNullSafe(ExtractableLiteral(l), a: AttributeReference) =>
   statsFor(a).lowerBound <= l && l <= statsFor(a).upperBound
 
-case LessThan(a: AttributeReference, l: Literal) => statsFor(a).lowerBound 
< l
-case LessThan(l: Literal, a: AttributeReference) => l < 
statsFor(a).upperBound
+case LessThan(a: AttributeReference, ExtractableLiteral(l)) => 
statsFor(a).lowerBound < l
+case LessThan(ExtractableLiteral(l), a: AttributeReference) => l < 
statsFor(a).upperBound
 
-case LessThanOrEqual(a: AttributeReference, l: Literal) => 
statsFor(a).lowerBound <= l
-case LessThanOrEqual(l: Literal, a: AttributeReference) => l <= 
statsFor(a).upperBound
+case LessThanOrEqual(a: AttributeReference, ExtractableLiteral(l)) =>
+  

svn commit: r28425 - in /dev/spark/2.3.3-SNAPSHOT-2018_07_29_22_01-bad56bb-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-29 Thread pwendell
Author: pwendell
Date: Mon Jul 30 05:15:24 2018
New Revision: 28425

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_07_29_22_01-bad56bb docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-21274][SQL] Implement INTERSECT ALL clause

2018-07-29 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 6690924c4 -> 65a4bc143


[SPARK-21274][SQL] Implement INTERSECT ALL clause

## What changes were proposed in this pull request?
Implements INTERSECT ALL clause through query rewrites using existing operators 
in Spark.  Please refer to 
[Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE)
 for the design.

Input Query
``` SQL
SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2
```
Rewritten Query
```SQL
   SELECT c1
FROM (
 SELECT replicate_row(min_count, c1)
 FROM (
  SELECT c1,
 IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS 
min_count
  FROM (
   SELECT   c1, count(vcol1) as vcol1_cnt, count(vcol2) as 
vcol2_cnt
   FROM (
SELECT c1, true as vcol1, null as vcol2 FROM ut1
UNION ALL
SELECT c1, null as vcol1, true as vcol2 FROM ut2
) AS union_all
   GROUP BY c1
   HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1
  )
  )
  )
```

## How was this patch tested?
Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite

Author: Dilip Biswal 

Closes #21886 from dilipbiswal/dkb_intersect_all_final.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65a4bc14
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65a4bc14
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65a4bc14

Branch: refs/heads/master
Commit: 65a4bc143ab5dc2ced589dc107bbafa8a7290931
Parents: 6690924
Author: Dilip Biswal 
Authored: Sun Jul 29 22:11:01 2018 -0700
Committer: Xiao Li 
Committed: Sun Jul 29 22:11:01 2018 -0700

--
 python/pyspark/sql/dataframe.py |  22 ++
 .../spark/sql/catalyst/analysis/Analyzer.scala  |   2 +-
 .../sql/catalyst/analysis/TypeCoercion.scala|   4 +-
 .../analysis/UnsupportedOperationChecker.scala  |   2 +-
 .../sql/catalyst/optimizer/Optimizer.scala  |  81 ++-
 .../spark/sql/catalyst/parser/AstBuilder.scala  |   2 +-
 .../plans/logical/basicLogicalOperators.scala   |   7 +-
 .../catalyst/optimizer/SetOperationSuite.scala  |  32 ++-
 .../sql/catalyst/parser/PlanParserSuite.scala   |   1 -
 .../scala/org/apache/spark/sql/Dataset.scala|  19 +-
 .../spark/sql/execution/SparkStrategies.scala   |   8 +-
 .../sql-tests/inputs/intersect-all.sql  | 123 ++
 .../sql-tests/results/intersect-all.sql.out | 241 +++
 .../org/apache/spark/sql/DataFrameSuite.scala   |  54 +
 .../org/apache/spark/sql/test/SQLTestData.scala |  13 +
 15 files changed, 599 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65a4bc14/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index b2e0a5b..07fb260 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1500,6 +1500,28 @@ class DataFrame(object):
 """
 return DataFrame(self._jdf.intersect(other._jdf), self.sql_ctx)
 
+@since(2.4)
+def intersectAll(self, other):
+""" Return a new :class:`DataFrame` containing rows in both this 
dataframe and other
+dataframe while preserving duplicates.
+
+This is equivalent to `INTERSECT ALL` in SQL.
+>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 
4)], ["C1", "C2"])
+>>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", 
"C2"])
+
+>>> df1.intersectAll(df2).sort("C1", "C2").show()
++---+---+
+| C1| C2|
++---+---+
+|  a|  1|
+|  a|  1|
+|  b|  3|
++---+---+
+
+Also as standard in SQL, this function resolves columns by position 
(not by name).
+"""
+return DataFrame(self._jdf.intersectAll(other._jdf), self.sql_ctx)
+
 @since(1.3)
 def subtract(self, other):
 """ Return a new :class:`DataFrame` containing rows in this frame

http://git-wip-us.apache.org/repos/asf/spark/blob/65a4bc14/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 8abb1c7..9965cd6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -914,7 +914,7 @@ 

svn commit: r28421 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_29_20_02-6690924-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-29 Thread pwendell
Author: pwendell
Date: Mon Jul 30 03:16:25 2018
New Revision: 28421

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_29_20_02-6690924 docs


[This commit notification would consist of 1470 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR] Avoid the 'latest' link that might vary per release in functions.scala's comment

2018-07-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 3210121fe -> 6690924c4


[MINOR] Avoid the 'latest' link that might vary per release in 
functions.scala's comment

## What changes were proposed in this pull request?

This PR propose to address 
https://github.com/apache/spark/pull/21318#discussion_r187843125 comment.

This is rather a nit but looks we better avoid to update the link for each 
release since it always points the latest (it doesn't look like worth enough 
updating release guide on the other hand as well).

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #21907 from HyukjinKwon/minor-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6690924c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6690924c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6690924c

Branch: refs/heads/master
Commit: 6690924c49a443cd629fcc1a4460cf443fb0a918
Parents: 3210121
Author: hyukjinkwon 
Authored: Mon Jul 30 10:02:29 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Jul 30 10:02:29 2018 +0800

--
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6690924c/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 2772958..a2d3792 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -44,8 +44,8 @@ import org.apache.spark.util.Utils
  *
  * Spark also includes more built-in functions that are less common and are 
not defined here.
  * You can still access them (and all the functions defined here) using the 
`functions.expr()` API
- * and calling them through a SQL expression string. You can find the entire 
list of functions for
- * the latest version of Spark at 
https://spark.apache.org/docs/latest/api/sql/index.html.
+ * and calling them through a SQL expression string. You can find the entire 
list of functions
+ * at SQL API documentation.
  *
  * As an example, `isnan` is a function that is defined here. You can use 
`isnan(col("myCol"))`
  * to invoke the `isnan` function. This way the programming language's 
compiler ensures `isnan`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml

2018-07-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 3695ba577 -> 3210121fe


[MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml

## What changes were proposed in this pull request?

This PR propose to remove `-Phive-thriftserver` profile which seems not 
affecting the SparkR tests in AppVeyor.

Originally wanted to check if there's a meaningful build time decrease but 
seems not. It will have but seems not meaningfully decreased.

## How was this patch tested?

AppVeyor tests:

```
[00:40:49] Attaching package: 'SparkR'
[00:40:49]
[00:40:49] The following objects are masked from 'package:testthat':
[00:40:49]
[00:40:49] describe, not
[00:40:49]
[00:40:49] The following objects are masked from 'package:stats':
[00:40:49]
[00:40:49] cov, filter, lag, na.omit, predict, sd, var, window
[00:40:49]
[00:40:49] The following objects are masked from 'package:base':
[00:40:49]
[00:40:49] as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
[00:40:49] rank, rbind, sample, startsWith, subset, summary, transform, 
union
[00:40:49]
[00:40:49] Spark package found in SPARK_HOME: C:\projects\spark\bin\..
[00:41:43] basic tests for CRAN: .
[00:41:43]
[00:41:43] DONE 
===
[00:41:43] binary functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:42:05] ...
[00:42:05] functions on binary files: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:42:10] 
[00:42:10] broadcast variables: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:42:12] ..
[00:42:12] functions in client.R: .
[00:42:30] test functions in sparkR.R: 
..
[00:42:30] include R packages: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:42:31]
[00:42:31] JVM API: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
[00:42:31] ..
[00:42:31] MLlib classification algorithms, except for tree-based algorithms: 
Spark package found in SPARK_HOME: C:\projects\spark\bin\..
[00:48:48] 
..
[00:48:48] MLlib clustering algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:50:12] .
[00:50:12] MLlib frequent pattern mining: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:50:18] .
[00:50:18] MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:50:27] 
[00:50:27] MLlib regression algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
[00:56:00] 

[00:56:00] MLlib statistics algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:56:04] 
[00:56:04] MLlib tree-based algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:58:20] 
..
[00:58:20] parallelize() and collect(): Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[00:58:20] .
[00:58:20] basic RDD functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[01:03:35] 

[01:03:35] SerDe functionality: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[01:03:39] ...
[01:03:39] partitionBy, groupByKey, reduceByKey etc.: Spark package found in 
SPARK_HOME: C:\projects\spark\bin\..
[01:04:20] 
[01:04:20] functions in sparkR.R: 
[01:04:20] SparkSQL functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
[01:04:50] 
-chgrp:
 'APPVYR-WIN\None' does not match expected pattern for group
[01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
[01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
[01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
[01:04:51] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
[01:04:51] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
[01:06:13] 

spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite

2018-07-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 71eb7d468 -> bad56bb7b


[MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and 
TaskSchedulerImplSuite

## What changes were proposed in this pull request?

In the `afterEach()` method of both `TastSetManagerSuite` and 
`TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, 
because it shall stop the SparkContext.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
The test failure is caused by the above reason, the newly added 
`barrierCoordinator` required `rpcEnv` which has been stopped before 
`TaskSchedulerImpl` doing cleanup.

## How was this patch tested?
Existing tests.

Author: Xingbo Jiang 

Closes #21908 from jiangxb1987/afterEach.

(cherry picked from commit 3695ba57731a669ed20e7f676edee602c292fbed)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bad56bb7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bad56bb7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bad56bb7

Branch: refs/heads/branch-2.3
Commit: bad56bb7b2340d338eac8cea07e9f1bb3e08b1ac
Parents: 71eb7d4
Author: Xingbo Jiang 
Authored: Mon Jul 30 09:58:28 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Jul 30 09:58:54 2018 +0800

--
 .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala  | 2 +-
 .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bad56bb7/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
index 33f2ea1..a9e0aed 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
@@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (taskScheduler != null) {
   taskScheduler.stop()
   taskScheduler = null
@@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   dagScheduler.stop()
   dagScheduler = null
 }
+super.afterEach()
   }
 
   def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = {

http://git-wip-us.apache.org/repos/asf/spark/blob/bad56bb7/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
index b4acccf..d75c245 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
@@ -178,12 +178,12 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (sched != null) {
   sched.dagScheduler.stop()
   sched.stop()
   sched = null
 }
+super.afterEach()
   }
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite

2018-07-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 f52d0c451 -> c4b37696f


[MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and 
TaskSchedulerImplSuite

## What changes were proposed in this pull request?

In the `afterEach()` method of both `TastSetManagerSuite` and 
`TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, 
because it shall stop the SparkContext.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
The test failure is caused by the above reason, the newly added 
`barrierCoordinator` required `rpcEnv` which has been stopped before 
`TaskSchedulerImpl` doing cleanup.

## How was this patch tested?
Existing tests.

Author: Xingbo Jiang 

Closes #21908 from jiangxb1987/afterEach.

(cherry picked from commit 3695ba57731a669ed20e7f676edee602c292fbed)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4b37696
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4b37696
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4b37696

Branch: refs/heads/branch-2.2
Commit: c4b37696f9551a6429c1d90e53f2499aada556b1
Parents: f52d0c4
Author: Xingbo Jiang 
Authored: Mon Jul 30 09:58:28 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Jul 30 09:59:15 2018 +0800

--
 .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala  | 2 +-
 .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c4b37696/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
index 38a4f40..2da9e17 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
@@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (taskScheduler != null) {
   taskScheduler.stop()
   taskScheduler = null
@@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   dagScheduler.stop()
   dagScheduler = null
 }
+super.afterEach()
   }
 
   def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = {

http://git-wip-us.apache.org/repos/asf/spark/blob/c4b37696/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
index 904f0b6..4d330b5 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
@@ -172,12 +172,12 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (sched != null) {
   sched.dagScheduler.stop()
   sched.stop()
   sched = null
 }
+super.afterEach()
   }
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite

2018-07-29 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 2c54aae1b -> 3695ba577


[MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and 
TaskSchedulerImplSuite

## What changes were proposed in this pull request?

In the `afterEach()` method of both `TastSetManagerSuite` and 
`TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, 
because it shall stop the SparkContext.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
The test failure is caused by the above reason, the newly added 
`barrierCoordinator` required `rpcEnv` which has been stopped before 
`TaskSchedulerImpl` doing cleanup.

## How was this patch tested?
Existing tests.

Author: Xingbo Jiang 

Closes #21908 from jiangxb1987/afterEach.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3695ba57
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3695ba57
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3695ba57

Branch: refs/heads/master
Commit: 3695ba57731a669ed20e7f676edee602c292fbed
Parents: 2c54aae
Author: Xingbo Jiang 
Authored: Mon Jul 30 09:58:28 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Jul 30 09:58:28 2018 +0800

--
 .../scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala  | 2 +-
 .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3695ba57/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
index 624384a..16c273b 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
@@ -62,7 +62,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (taskScheduler != null) {
   taskScheduler.stop()
   taskScheduler = null
@@ -71,6 +70,7 @@ class TaskSchedulerImplSuite extends SparkFunSuite with 
LocalSparkContext with B
   dagScheduler.stop()
   dagScheduler = null
 }
+super.afterEach()
   }
 
   def setupScheduler(confs: (String, String)*): TaskSchedulerImpl = {

http://git-wip-us.apache.org/repos/asf/spark/blob/3695ba57/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
index cf05434..d264ada 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
@@ -178,12 +178,12 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
   }
 
   override def afterEach(): Unit = {
-super.afterEach()
 if (sched != null) {
   sched.dagScheduler.stop()
   sched.stop()
   sched = null
 }
+super.afterEach()
   }
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r28419 - in /dev/spark/2.3.3-SNAPSHOT-2018_07_29_14_01-71eb7d4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-29 Thread pwendell
Author: pwendell
Date: Sun Jul 29 21:15:26 2018
New Revision: 28419

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_07_29_14_01-71eb7d4 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error

2018-07-29 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 7d50fec3f -> a3eb07db3


[SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in 
data error

When join key is long or int in broadcast join, Spark will use 
`LongToUnsafeRowMap` to store key-values of the table witch will be 
broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it 
is too big to hold in memory, it will be stored in disk. At that time, because 
`write` uses a variable `cursor` to determine how many bytes in `page` of 
`LongToUnsafeRowMap` will be write out and the `cursor` was not restore when 
deserializing, executor will write out nothing from page into disk.

## What changes were proposed in this pull request?
Restore cursor value when deserializing.

Author: liulijia 

Closes #21772 from liutang123/SPARK-24809.

(cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab)
Signed-off-by: Xiao Li 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3eb07db
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3eb07db
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3eb07db

Branch: refs/heads/branch-2.1
Commit: a3eb07db3be80be663ca66f1e9a11fcef8ab6c20
Parents: 7d50fec
Author: liulijia 
Authored: Sun Jul 29 13:13:00 2018 -0700
Committer: Xiao Li 
Committed: Sun Jul 29 13:14:57 2018 -0700

--
 .../sql/execution/joins/HashedRelation.scala|  2 ++
 .../execution/joins/HashedRelationSuite.scala   | 29 
 2 files changed, 31 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a3eb07db/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
index f7e8ea6..206afcd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@@ -741,6 +741,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: 
TaskMemoryManager, cap
 array = readLongArray(readBuffer, length)
 val pageLength = readLong().toInt
 page = readLongArray(readBuffer, pageLength)
+// Restore cursor variable to make this map able to be serialized again on 
executors.
+cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET
   }
 
   override def readExternal(in: ObjectInput): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/a3eb07db/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index f0288c8..9c9e9dc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
@@ -277,6 +277,35 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in 
data error") {
+val unsafeProj = UnsafeProjection.create(Array[DataType](LongType))
+val originalMap = new LongToUnsafeRowMap(mm, 1)
+
+val key1 = 1L
+val value1 = 4852306286022334418L
+
+val key2 = 2L
+val value2 = 8813607448788216010L
+
+originalMap.append(key1, unsafeProj(InternalRow(value1)))
+originalMap.append(key2, unsafeProj(InternalRow(value2)))
+originalMap.optimize()
+
+val ser = sparkContext.env.serializer.newInstance()
+// Simulate serialize/deserialize twice on driver and executor
+val firstTimeSerialized = 
ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap))
+val secondTimeSerialized =
+  ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized))
+
+val resultRow = new UnsafeRow(1)
+assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === 
value1)
+assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === 
value2)
+
+originalMap.free()
+firstTimeSerialized.free()
+secondTimeSerialized.free()
+  }
+
   test("Spark-14521") {
 val ser = new KryoSerializer(
   (new SparkConf).set("spark.kryo.referenceTracking", 
"false")).newInstance()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: 

spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error

2018-07-29 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 d5f340f27 -> 71eb7d468


[SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in 
data error

When join key is long or int in broadcast join, Spark will use 
`LongToUnsafeRowMap` to store key-values of the table witch will be 
broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it 
is too big to hold in memory, it will be stored in disk. At that time, because 
`write` uses a variable `cursor` to determine how many bytes in `page` of 
`LongToUnsafeRowMap` will be write out and the `cursor` was not restore when 
deserializing, executor will write out nothing from page into disk.

## What changes were proposed in this pull request?
Restore cursor value when deserializing.

Author: liulijia 

Closes #21772 from liutang123/SPARK-24809.

(cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab)
Signed-off-by: Xiao Li 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71eb7d46
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71eb7d46
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71eb7d46

Branch: refs/heads/branch-2.3
Commit: 71eb7d4682a7e85e4de580ffe110da961d84817f
Parents: d5f340f
Author: liulijia 
Authored: Sun Jul 29 13:13:00 2018 -0700
Committer: Xiao Li 
Committed: Sun Jul 29 13:13:22 2018 -0700

--
 .../sql/execution/joins/HashedRelation.scala|  2 ++
 .../execution/joins/HashedRelationSuite.scala   | 29 
 2 files changed, 31 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/71eb7d46/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
index 20ce01f..86eb47a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@@ -772,6 +772,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: 
TaskMemoryManager, cap
 array = readLongArray(readBuffer, length)
 val pageLength = readLong().toInt
 page = readLongArray(readBuffer, pageLength)
+// Restore cursor variable to make this map able to be serialized again on 
executors.
+cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET
   }
 
   override def readExternal(in: ObjectInput): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/71eb7d46/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index 037cc2e..d9b34dc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
@@ -278,6 +278,35 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in 
data error") {
+val unsafeProj = UnsafeProjection.create(Array[DataType](LongType))
+val originalMap = new LongToUnsafeRowMap(mm, 1)
+
+val key1 = 1L
+val value1 = 4852306286022334418L
+
+val key2 = 2L
+val value2 = 8813607448788216010L
+
+originalMap.append(key1, unsafeProj(InternalRow(value1)))
+originalMap.append(key2, unsafeProj(InternalRow(value2)))
+originalMap.optimize()
+
+val ser = sparkContext.env.serializer.newInstance()
+// Simulate serialize/deserialize twice on driver and executor
+val firstTimeSerialized = 
ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap))
+val secondTimeSerialized =
+  ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized))
+
+val resultRow = new UnsafeRow(1)
+assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === 
value1)
+assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === 
value2)
+
+originalMap.free()
+firstTimeSerialized.free()
+secondTimeSerialized.free()
+  }
+
   test("Spark-14521") {
 val ser = new KryoSerializer(
   (new SparkConf).set("spark.kryo.referenceTracking", 
"false")).newInstance()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: 

spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error

2018-07-29 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 73764737d -> f52d0c451


[SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in 
data error

When join key is long or int in broadcast join, Spark will use 
`LongToUnsafeRowMap` to store key-values of the table witch will be 
broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it 
is too big to hold in memory, it will be stored in disk. At that time, because 
`write` uses a variable `cursor` to determine how many bytes in `page` of 
`LongToUnsafeRowMap` will be write out and the `cursor` was not restore when 
deserializing, executor will write out nothing from page into disk.

## What changes were proposed in this pull request?
Restore cursor value when deserializing.

Author: liulijia 

Closes #21772 from liutang123/SPARK-24809.

(cherry picked from commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab)
Signed-off-by: Xiao Li 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f52d0c45
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f52d0c45
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f52d0c45

Branch: refs/heads/branch-2.2
Commit: f52d0c4515f3f0ceaea03c661fb7739c70c25236
Parents: 7376473
Author: liulijia 
Authored: Sun Jul 29 13:13:00 2018 -0700
Committer: Xiao Li 
Committed: Sun Jul 29 13:13:57 2018 -0700

--
 .../sql/execution/joins/HashedRelation.scala|  2 ++
 .../execution/joins/HashedRelationSuite.scala   | 29 
 2 files changed, 31 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f52d0c45/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
index 07ee3d0..78190bf 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@@ -741,6 +741,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: 
TaskMemoryManager, cap
 array = readLongArray(readBuffer, length)
 val pageLength = readLong().toInt
 page = readLongArray(readBuffer, pageLength)
+// Restore cursor variable to make this map able to be serialized again on 
executors.
+cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET
   }
 
   override def readExternal(in: ObjectInput): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/f52d0c45/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index f0288c8..9c9e9dc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
@@ -277,6 +277,35 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in 
data error") {
+val unsafeProj = UnsafeProjection.create(Array[DataType](LongType))
+val originalMap = new LongToUnsafeRowMap(mm, 1)
+
+val key1 = 1L
+val value1 = 4852306286022334418L
+
+val key2 = 2L
+val value2 = 8813607448788216010L
+
+originalMap.append(key1, unsafeProj(InternalRow(value1)))
+originalMap.append(key2, unsafeProj(InternalRow(value2)))
+originalMap.optimize()
+
+val ser = sparkContext.env.serializer.newInstance()
+// Simulate serialize/deserialize twice on driver and executor
+val firstTimeSerialized = 
ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap))
+val secondTimeSerialized =
+  ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized))
+
+val resultRow = new UnsafeRow(1)
+assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === 
value1)
+assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === 
value2)
+
+originalMap.free()
+firstTimeSerialized.free()
+secondTimeSerialized.free()
+  }
+
   test("Spark-14521") {
 val ser = new KryoSerializer(
   (new SparkConf).set("spark.kryo.referenceTracking", 
"false")).newInstance()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: 

spark git commit: [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error

2018-07-29 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 8fe5d2c39 -> 2c54aae1b


[SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in 
data error

When join key is long or int in broadcast join, Spark will use 
`LongToUnsafeRowMap` to store key-values of the table witch will be 
broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it 
is too big to hold in memory, it will be stored in disk. At that time, because 
`write` uses a variable `cursor` to determine how many bytes in `page` of 
`LongToUnsafeRowMap` will be write out and the `cursor` was not restore when 
deserializing, executor will write out nothing from page into disk.

## What changes were proposed in this pull request?
Restore cursor value when deserializing.

Author: liulijia 

Closes #21772 from liutang123/SPARK-24809.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c54aae1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c54aae1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c54aae1

Branch: refs/heads/master
Commit: 2c54aae1bc2fa3da26917c89e6201fb2108d9fab
Parents: 8fe5d2c
Author: liulijia 
Authored: Sun Jul 29 13:13:00 2018 -0700
Committer: Xiao Li 
Committed: Sun Jul 29 13:13:00 2018 -0700

--
 .../sql/execution/joins/HashedRelation.scala|  2 ++
 .../execution/joins/HashedRelationSuite.scala   | 29 
 2 files changed, 31 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2c54aae1/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
index 20ce01f..86eb47a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@@ -772,6 +772,8 @@ private[execution] final class LongToUnsafeRowMap(val mm: 
TaskMemoryManager, cap
 array = readLongArray(readBuffer, length)
 val pageLength = readLong().toInt
 page = readLongArray(readBuffer, pageLength)
+// Restore cursor variable to make this map able to be serialized again on 
executors.
+cursor = pageLength * 8 + Platform.LONG_ARRAY_OFFSET
   }
 
   override def readExternal(in: ObjectInput): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/2c54aae1/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
index 037cc2e..d9b34dc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
@@ -278,6 +278,35 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("SPARK-24809: Serializing LongToUnsafeRowMap in executor may result in 
data error") {
+val unsafeProj = UnsafeProjection.create(Array[DataType](LongType))
+val originalMap = new LongToUnsafeRowMap(mm, 1)
+
+val key1 = 1L
+val value1 = 4852306286022334418L
+
+val key2 = 2L
+val value2 = 8813607448788216010L
+
+originalMap.append(key1, unsafeProj(InternalRow(value1)))
+originalMap.append(key2, unsafeProj(InternalRow(value2)))
+originalMap.optimize()
+
+val ser = sparkContext.env.serializer.newInstance()
+// Simulate serialize/deserialize twice on driver and executor
+val firstTimeSerialized = 
ser.deserialize[LongToUnsafeRowMap](ser.serialize(originalMap))
+val secondTimeSerialized =
+  ser.deserialize[LongToUnsafeRowMap](ser.serialize(firstTimeSerialized))
+
+val resultRow = new UnsafeRow(1)
+assert(secondTimeSerialized.getValue(key1, resultRow).getLong(0) === 
value1)
+assert(secondTimeSerialized.getValue(key2, resultRow).getLong(0) === 
value2)
+
+originalMap.free()
+firstTimeSerialized.free()
+secondTimeSerialized.free()
+  }
+
   test("Spark-14521") {
 val ser = new KryoSerializer(
   (new SparkConf).set("spark.kryo.referenceTracking", 
"false")).newInstance()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r28416 - in /dev/spark/2.4.0-SNAPSHOT-2018_07_29_08_02-8fe5d2c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-07-29 Thread pwendell
Author: pwendell
Date: Sun Jul 29 15:17:11 2018
New Revision: 28416

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_07_29_08_02-8fe5d2c docs


[This commit notification would consist of 1470 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark-website git commit: (Forgot Spark version change in last commit)

2018-07-29 Thread srowen
Repository: spark-website
Updated Branches:
  refs/heads/asf-site 50b4660ce -> 03f5adcb8


(Forgot Spark version change in last commit)


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/03f5adcb
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/03f5adcb
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/03f5adcb

Branch: refs/heads/asf-site
Commit: 03f5adcb8d9f6e8e7fb61f176bc1d28ff67da3fe
Parents: 50b4660
Author: Sean Owen 
Authored: Sun Jul 29 09:18:14 2018 -0500
Committer: Sean Owen 
Committed: Sun Jul 29 09:18:14 2018 -0500

--
 site/versioning-policy.html | 2 +-
 versioning-policy.md| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/03f5adcb/site/versioning-policy.html
--
diff --git a/site/versioning-policy.html b/site/versioning-policy.html
index e660c3b..08d251d 100644
--- a/site/versioning-policy.html
+++ b/site/versioning-policy.html
@@ -252,7 +252,7 @@ generally be released about 6 months after 2.2.0. 
Maintenance releases happen as
 A minor release usually sees 1-2 maintenance releases in the 6 months 
following its first
 release. Major releases do not happen according to a fixed schedule.
 
-Spark 2.3 Release Window
+Spark 2.4 Release Window
 
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/03f5adcb/versioning-policy.md
--
diff --git a/versioning-policy.md b/versioning-policy.md
index 22e9da9..60e0b82 100644
--- a/versioning-policy.md
+++ b/versioning-policy.md
@@ -57,7 +57,7 @@ generally be released about 6 months after 2.2.0. Maintenance 
releases happen as
 A minor release usually sees 1-2 maintenance releases in the 6 months 
following its first
 release. Major releases do not happen according to a fixed schedule.
 
-Spark 2.3 Release Window
+Spark 2.4 Release Window
 
 | Date  | | Event  |
 | - |-| -- |


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark-website git commit: Update Spark 2.4 release window (and fix Spark URLs in sitemap)

2018-07-29 Thread srowen
Repository: spark-website
Updated Branches:
  refs/heads/asf-site d86cffd19 -> 50b4660ce


Update Spark 2.4 release window (and fix Spark URLs in sitemap)


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/50b4660c
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/50b4660c
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/50b4660c

Branch: refs/heads/asf-site
Commit: 50b4660ce81f04fe34b995f3e9f0a74e336f482c
Parents: d86cffd
Author: Sean Owen 
Authored: Sun Jul 29 09:16:53 2018 -0500
Committer: Sean Owen 
Committed: Sun Jul 29 09:16:53 2018 -0500

--
 site/mailing-lists.html |   2 +-
 site/sitemap.xml| 318 +++
 site/versioning-policy.html |   6 +-
 versioning-policy.md|   6 +-
 4 files changed, 166 insertions(+), 166 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/50b4660c/site/mailing-lists.html
--
diff --git a/site/mailing-lists.html b/site/mailing-lists.html
index d447046..f7ae56f 100644
--- a/site/mailing-lists.html
+++ b/site/mailing-lists.html
@@ -12,7 +12,7 @@
 
   
 
-http://localhost:4000/community.html; />
+https://spark.apache.org/community.html; />
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/50b4660c/site/sitemap.xml
--
diff --git a/site/sitemap.xml b/site/sitemap.xml
index dd69976..87ca6f6 100644
--- a/site/sitemap.xml
+++ b/site/sitemap.xml
@@ -139,641 +139,641 @@
 
 
 
-  
http://localhost:4000/news/spark-summit-oct-2018-agenda-posted.html
+  
https://spark.apache.org/news/spark-summit-oct-2018-agenda-posted.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-2-2.html
+  https://spark.apache.org/releases/spark-release-2-2-2.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-2-2-released.html
+  https://spark.apache.org/news/spark-2-2-2-released.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-1-3.html
+  https://spark.apache.org/releases/spark-release-2-1-3.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-1-3-released.html
+  https://spark.apache.org/news/spark-2-1-3-released.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-3-1.html
+  https://spark.apache.org/releases/spark-release-2-3-1.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-3-1-released.html
+  https://spark.apache.org/news/spark-2-3-1-released.html
   weekly
 
 
-  
http://localhost:4000/news/spark-summit-june-2018-agenda-posted.html
+  
https://spark.apache.org/news/spark-summit-june-2018-agenda-posted.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-3-0.html
+  https://spark.apache.org/releases/spark-release-2-3-0.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-3-0-released.html
+  https://spark.apache.org/news/spark-2-3-0-released.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-2-1.html
+  https://spark.apache.org/releases/spark-release-2-2-1.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-2-1-released.html
+  https://spark.apache.org/news/spark-2-2-1-released.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-1-2.html
+  https://spark.apache.org/releases/spark-release-2-1-2.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-1-2-released.html
+  https://spark.apache.org/news/spark-2-1-2-released.html
   weekly
 
 
-  http://localhost:4000/news/spark-summit-eu-2017-agenda-posted.html
+  
https://spark.apache.org/news/spark-summit-eu-2017-agenda-posted.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-2-0.html
+  https://spark.apache.org/releases/spark-release-2-2-0.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-2-0-released.html
+  https://spark.apache.org/news/spark-2-2-0-released.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-1-1.html
+  https://spark.apache.org/releases/spark-release-2-1-1.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-1-1-released.html
+  https://spark.apache.org/news/spark-2-1-1-released.html
   weekly
 
 
-  
http://localhost:4000/news/spark-summit-june-2017-agenda-posted.html
+  
https://spark.apache.org/news/spark-summit-june-2017-agenda-posted.html
   weekly
 
 
-  
http://localhost:4000/news/spark-summit-east-2017-agenda-posted.html
+  
https://spark.apache.org/news/spark-summit-east-2017-agenda-posted.html
   weekly
 
 
-  http://localhost:4000/releases/spark-release-2-1-0.html
+  https://spark.apache.org/releases/spark-release-2-1-0.html
   weekly
 
 
-  http://localhost:4000/news/spark-2-1-0-released.html
+  https://spark.apache.org/news/spark-2-1-0-released.html
   weekly
 
 

spark git commit: [SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4

2018-07-29 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master c5b8d54c6 -> 8fe5d2c39


[SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4

## What changes were proposed in this pull request?

This PR updates maven version from 3.3.9 to 3.5.4. The current build process 
uses mvn 3.3.9 that was release on 2015, which looks pretty old.
We met [an issue](https://issues.apache.org/jira/browse/SPARK-24895) to need 
the maven 3.5.2 or later.

The release note of the 3.5.4 is 
[here](https://maven.apache.org/docs/3.5.4/release-notes.html). Note version 
3.4 was skipped.

>From [the release note of the 
>3.5.0](https://maven.apache.org/docs/3.5.0/release-notes.html), the followings 
>are new features:
1. ANSI color logging for improved output visibility
1. add support for module name != artifactId in every calculated URLs (project, 
SCM, site): special project.directory property
1. create a slf4j-simple provider extension that supports level color rendering
1. ModelResolver interface enhancement: addition of resolveModel(Dependency) 
supporting version ranges

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki 

Closes #21905 from kiszk/SPARK-24956.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8fe5d2c3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8fe5d2c3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8fe5d2c3

Branch: refs/heads/master
Commit: 8fe5d2c393f035b9e82ba42202421c9ba66d6c78
Parents: c5b8d54
Author: Kazuaki Ishizaki 
Authored: Sun Jul 29 08:31:16 2018 -0500
Committer: Sean Owen 
Committed: Sun Jul 29 08:31:16 2018 -0500

--
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8fe5d2c3/pom.xml
--
diff --git a/pom.xml b/pom.xml
index f320844..9f60edc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -114,7 +114,7 @@
 1.8
 ${java.version}
 ${java.version}
-3.3.9
+3.5.4
 spark
 1.7.16
 1.2.17


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org