date:20171130

spark git commit: [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path

2017-11-30 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 af8a692d6 -> ba00bd961


[SPARK-22601][SQL] Data load is getting displayed successful on providing non 
existing nonlocal file path

## What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not 
validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario 
of none existing local file path. This PR has added validation in the scenario 
of nonexisting hdfs file path
## How was this patch tested?
UT has been added for verifying the issue, also snapshots has been added after 
the verification in a spark yarn cluster

Author: sujith71955 

Closes #19823 from sujith71955/master_LoadComand_Issue.

(cherry picked from commit 16adaf634bcca3074b448d95e72177eefdf50069)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba00bd96
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba00bd96
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba00bd96

Branch: refs/heads/branch-2.2
Commit: ba00bd9615cc37a903f4333dad57e0eeafbdfd0c
Parents: af8a692
Author: sujith71955 
Authored: Thu Nov 30 20:45:30 2017 -0800
Committer: gatorsmile 
Committed: Thu Nov 30 20:46:46 2017 -0800

--
 .../org/apache/spark/sql/execution/command/tables.scala | 9 -
 .../org/apache/spark/sql/hive/execution/HiveDDLSuite.scala  | 9 +
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ba00bd96/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index 8b61240..126c1cb 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -333,7 +333,7 @@ case class LoadDataCommand(
 uri
   } else {
 val uri = new URI(path)
-if (uri.getScheme() != null && uri.getAuthority() != null) {
+val hdfsUri = if (uri.getScheme() != null && uri.getAuthority() != 
null) {
   uri
 } else {
   // Follow Hive's behavior:
@@ -373,6 +373,13 @@ case class LoadDataCommand(
   }
   new URI(scheme, authority, absolutePath, uri.getQuery(), 
uri.getFragment())
 }
+val hadoopConf = sparkSession.sessionState.newHadoopConf()
+val srcPath = new Path(hdfsUri)
+val fs = srcPath.getFileSystem(hadoopConf)
+if (!fs.exists(srcPath)) {
+  throw new AnalysisException(s"LOAD DATA input path does not exist: 
$path")
+}
+hdfsUri
   }
 
 if (partition.nonEmpty) {

http://git-wip-us.apache.org/repos/asf/spark/blob/ba00bd96/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
index c1c8281..f4c2625 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
@@ -1983,4 +1983,13 @@ class HiveDDLSuite
   }
 }
   }
+
+  test("load command for non local invalid path validation") {
+withTable("tbl") {
+  sql("CREATE TABLE tbl(i INT, j STRING)")
+  val e = intercept[AnalysisException](
+sql("load data inpath '/doesnotexist.csv' into table tbl"))
+  assert(e.message.contains("LOAD DATA input path does not exist"))
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path

2017-11-30 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master dc365422b -> 16adaf634


[SPARK-22601][SQL] Data load is getting displayed successful on providing non 
existing nonlocal file path

## What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not 
validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario 
of none existing local file path. This PR has added validation in the scenario 
of nonexisting hdfs file path
## How was this patch tested?
UT has been added for verifying the issue, also snapshots has been added after 
the verification in a spark yarn cluster

Author: sujith71955 

Closes #19823 from sujith71955/master_LoadComand_Issue.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16adaf63
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16adaf63
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16adaf63

Branch: refs/heads/master
Commit: 16adaf634bcca3074b448d95e72177eefdf50069
Parents: dc36542
Author: sujith71955 
Authored: Thu Nov 30 20:45:30 2017 -0800
Committer: gatorsmile 
Committed: Thu Nov 30 20:45:30 2017 -0800

--
 .../org/apache/spark/sql/execution/command/tables.scala | 9 -
 .../org/apache/spark/sql/hive/execution/HiveDDLSuite.scala  | 9 +
 2 files changed, 17 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16adaf63/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index c9f6e57..c42e6c3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -340,7 +340,7 @@ case class LoadDataCommand(
 uri
   } else {
 val uri = new URI(path)
-if (uri.getScheme() != null && uri.getAuthority() != null) {
+val hdfsUri = if (uri.getScheme() != null && uri.getAuthority() != 
null) {
   uri
 } else {
   // Follow Hive's behavior:
@@ -380,6 +380,13 @@ case class LoadDataCommand(
   }
   new URI(scheme, authority, absolutePath, uri.getQuery(), 
uri.getFragment())
 }
+val hadoopConf = sparkSession.sessionState.newHadoopConf()
+val srcPath = new Path(hdfsUri)
+val fs = srcPath.getFileSystem(hadoopConf)
+if (!fs.exists(srcPath)) {
+  throw new AnalysisException(s"LOAD DATA input path does not exist: 
$path")
+}
+hdfsUri
   }
 
 if (partition.nonEmpty) {

http://git-wip-us.apache.org/repos/asf/spark/blob/16adaf63/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
index 9063ef0..6c11905 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
@@ -2141,4 +2141,13 @@ class HiveDDLSuite
   }
 }
   }
+
+  test("load command for non local invalid path validation") {
+withTable("tbl") {
+  sql("CREATE TABLE tbl(i INT, j STRING)")
+  val e = intercept[AnalysisException](
+sql("load data inpath '/doesnotexist.csv' into table tbl"))
+  assert(e.message.contains("LOAD DATA input path does not exist"))
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac…

2017-11-30 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 0121ebc83 -> af8a692d6


[SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBacâ¦

https://issues.apache.org/jira/browse/SPARK-22653
executorRef.address can be null, pass the executorAddress which accounts for it 
being null a few lines above the fix.

Manually tested this patch. You can reproduce the issue by running a simple 
spark-shell in yarn client mode with dynamic allocation and request some 
executors up front. Let those executors idle timeout. Get a heap dump. Without 
this fix, you will see that addressToExecutorId still contains the ids, with 
the fix addressToExecutorId is properly cleaned up.

Author: Thomas Graves 

Closes #19850 from tgravescs/SPARK-22653.

(cherry picked from commit dc365422bb337d19ef39739c7c3cf9e53ec85d09)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/af8a692d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/af8a692d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/af8a692d

Branch: refs/heads/branch-2.2
Commit: af8a692d6568fd612fa418763d4e65acbbee0fd4
Parents: 0121ebc
Author: Thomas Graves 
Authored: Fri Dec 1 10:53:16 2017 +0800
Committer: Wenchen Fan 
Committed: Fri Dec 1 10:57:25 2017 +0800

--
 .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/af8a692d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index dc82bb7..ab6e426 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -178,7 +178,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   addressToExecutorId(executorAddress) = executorId
   totalCoreCount.addAndGet(cores)
   totalRegisteredExecutors.addAndGet(1)
-  val data = new ExecutorData(executorRef, executorRef.address, 
hostname,
+  val data = new ExecutorData(executorRef, executorAddress, hostname,
 cores, cores, logUrls)
   // This must be synchronized because variables mutated
   // in this block are read when requesting executors


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac…

2017-11-30 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 7da1f5708 -> dc365422b


[SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBacâ¦

https://issues.apache.org/jira/browse/SPARK-22653
executorRef.address can be null, pass the executorAddress which accounts for it 
being null a few lines above the fix.

Manually tested this patch. You can reproduce the issue by running a simple 
spark-shell in yarn client mode with dynamic allocation and request some 
executors up front. Let those executors idle timeout. Get a heap dump. Without 
this fix, you will see that addressToExecutorId still contains the ids, with 
the fix addressToExecutorId is properly cleaned up.

Author: Thomas Graves 

Closes #19850 from tgravescs/SPARK-22653.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc365422
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc365422
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc365422

Branch: refs/heads/master
Commit: dc365422bb337d19ef39739c7c3cf9e53ec85d09
Parents: 7da1f57
Author: Thomas Graves 
Authored: Fri Dec 1 10:53:16 2017 +0800
Committer: Wenchen Fan 
Committed: Fri Dec 1 10:53:16 2017 +0800

--
 .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc365422/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index 22d9c4c..7bfb4d5 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -182,7 +182,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   addressToExecutorId(executorAddress) = executorId
   totalCoreCount.addAndGet(cores)
   totalRegisteredExecutors.addAndGet(1)
-  val data = new ExecutorData(executorRef, executorRef.address, 
hostname,
+  val data = new ExecutorData(executorRef, executorAddress, hostname,
 cores, cores, logUrls)
   // This must be synchronized because variables mutated
   // in this block are read when requesting executors


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

2017-11-30 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 7e5f669eb -> 7da1f5708


[SPARK-22373] Bump Janino dependency version to fix thread safety issueâ¦

â¦ with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during compiling 
generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, so 
the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.

Author: Min Shen 

Closes #19839 from Victsm/SPARK-22373.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7da1f570
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7da1f570
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7da1f570

Branch: refs/heads/master
Commit: 7da1f5708cc96c18ddb3acd09542621275e71d83
Parents: 7e5f669
Author: Min Shen 
Authored: Thu Nov 30 19:24:44 2017 -0600
Committer: Sean Owen 
Committed: Thu Nov 30 19:24:44 2017 -0600

--
 dev/deps/spark-deps-hadoop-2.6 | 4 ++--
 dev/deps/spark-deps-hadoop-2.7 | 4 ++--
 pom.xml| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7da1f570/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 50ac6d1..8f50821 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -35,7 +35,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -96,7 +96,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.7.9.jar
 jackson-module-scala_2.11-2.6.7.1.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/7da1f570/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 1b1e316..68e937f 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -35,7 +35,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -96,7 +96,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.7.9.jar
 jackson-module-scala_2.11-2.6.7.1.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/7da1f570/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 7bc66e7..731ee86 100644
--- a/pom.xml
+++ b/pom.xml
@@ -170,7 +170,7 @@
 
 3.5
 3.2.10
-3.0.0
+3.0.7
 2.22.2
 2.9.3
 3.5.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

2017-11-30 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 a02a8bd23 -> 4e3680f22


[SPARK-22373] Bump Janino dependency version to fix thread safety issueâ¦

â¦ with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during compiling 
generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, so 
the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.

Author: Min Shen 

Closes #19839 from Victsm/SPARK-22373.

(cherry picked from commit 7da1f5708cc96c18ddb3acd09542621275e71d83)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e3680f2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e3680f2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e3680f2

Branch: refs/heads/branch-2.1
Commit: 4e3680f221868d7fcca23e1aafea20b80465b79b
Parents: a02a8bd
Author: Min Shen 
Authored: Thu Nov 30 19:24:44 2017 -0600
Committer: Sean Owen 
Committed: Thu Nov 30 19:25:03 2017 -0600

--
 dev/deps/spark-deps-hadoop-2.6 | 4 ++--
 dev/deps/spark-deps-hadoop-2.7 | 4 ++--
 pom.xml| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4e3680f2/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 80f8b5b..f67b534 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -31,7 +31,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -90,7 +90,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.6.5.jar
 jackson-module-scala_2.11-2.6.5.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/4e3680f2/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 8d150ff..b2bded3 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -31,7 +31,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -90,7 +90,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.6.5.jar
 jackson-module-scala_2.11-2.6.5.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/4e3680f2/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 85f3145..534e31a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -170,7 +170,7 @@
 
 3.5
 3.2.10
-3.0.0
+3.0.7
 2.22.2
 2.9.3
 3.5.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

2017-11-30 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 d7b14746d -> 0121ebc83


[SPARK-22373] Bump Janino dependency version to fix thread safety issueâ¦

â¦ with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during compiling 
generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, so 
the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.

Author: Min Shen 

Closes #19839 from Victsm/SPARK-22373.

(cherry picked from commit 7da1f5708cc96c18ddb3acd09542621275e71d83)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0121ebc8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0121ebc8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0121ebc8

Branch: refs/heads/branch-2.2
Commit: 0121ebc8312ccb8292a673d04f70a2fd56555412
Parents: d7b1474
Author: Min Shen 
Authored: Thu Nov 30 19:24:44 2017 -0600
Committer: Sean Owen 
Committed: Thu Nov 30 19:24:52 2017 -0600

--
 dev/deps/spark-deps-hadoop-2.6 | 4 ++--
 dev/deps/spark-deps-hadoop-2.7 | 4 ++--
 pom.xml| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0121ebc8/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 02c0b21..036d811 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -31,7 +31,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -90,7 +90,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.6.5.jar
 jackson-module-scala_2.11-2.6.5.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/0121ebc8/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 47e28de..3c75273 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -31,7 +31,7 @@ commons-beanutils-core-1.8.0.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
-commons-compiler-3.0.0.jar
+commons-compiler-3.0.7.jar
 commons-compress-1.4.1.jar
 commons-configuration-1.6.jar
 commons-crypto-1.0.0.jar
@@ -90,7 +90,7 @@ jackson-mapper-asl-1.9.13.jar
 jackson-module-paranamer-2.6.5.jar
 jackson-module-scala_2.11-2.6.5.jar
 jackson-xc-1.9.13.jar
-janino-3.0.0.jar
+janino-3.0.7.jar
 java-xmlbuilder-1.0.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/0121ebc8/pom.xml
--
diff --git a/pom.xml b/pom.xml
index c027f57..3908aab 100644
--- a/pom.xml
+++ b/pom.xml
@@ -170,7 +170,7 @@
 
 3.5
 3.2.10
-3.0.0
+3.0.7
 2.22.2
 2.9.3
 3.5.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22428][DOC] Add spark application garbage collector configurat…

2017-11-30 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master f5f8e84d9 -> 7e5f669eb


[SPARK-22428][DOC] Add spark application garbage collector configuratâ¦

## What changes were proposed in this pull request?

The spark properties for configuring the ContextCleaner are not documented in 
the official documentation at 
https://spark.apache.org/docs/latest/configuration.html#available-properties.

This PR adds the doc.

## How was this patch tested?

Manual.

```
cd docs
jekyll build
open _site/configuration.html
```

Author: gaborgsomogyi 

Closes #19826 from gaborgsomogyi/SPARK-22428.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7e5f669e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7e5f669e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7e5f669e

Branch: refs/heads/master
Commit: 7e5f669eb684629c88218f8ec26c01a41a6fef32
Parents: f5f8e84
Author: gaborgsomogyi 
Authored: Thu Nov 30 19:20:32 2017 -0600
Committer: Sean Owen 
Committed: Thu Nov 30 19:20:32 2017 -0600

--
 docs/configuration.md | 40 
 1 file changed, 40 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7e5f669e/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index e42f866..ef061dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1132,6 +1132,46 @@ Apart from these, the following properties are also 
available, and may be useful
 to get the replication level of the block to the initial number.
   
 
+
+  spark.cleaner.periodicGC.interval
+  30min
+  
+Controls how often to trigger a garbage collection.
+This context cleaner triggers cleanups only when weak references are 
garbage collected.
+In long-running applications with large driver JVMs, where there is little 
memory pressure
+on the driver, this may happen very occasionally or not at all. Not 
cleaning at all may
+lead to executors running out of disk space after a while.
+  
+
+
+  spark.cleaner.referenceTracking
+  true
+  
+Enables or disables context cleaning.
+  
+
+
+  spark.cleaner.referenceTracking.blocking
+  true
+  
+Controls whether the cleaning thread should block on cleanup tasks (other 
than shuffle, which is controlled by
+spark.cleaner.referenceTracking.blocking.shuffle Spark 
property).
+  
+
+
+  spark.cleaner.referenceTracking.blocking.shuffle
+  false
+  
+Controls whether the cleaning thread should block on shuffle cleanup tasks.
+  
+
+
+  spark.cleaner.referenceTracking.cleanCheckpoints
+  false
+  
+Controls whether to clean checkpoint files if the reference is out of 
scope.
+  
+
 
 
 ### Execution Behavior


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22614] Dataset API: repartitionByRange(...)

2017-11-30 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master bcceab649 -> f5f8e84d9


[SPARK-22614] Dataset API: repartitionByRange(...)

## What changes were proposed in this pull request?

This PR introduces a way to explicitly range-partition a Dataset. So far, only 
round-robin and hash partitioning were possible via `df.repartition(...)`, but 
sometimes range partitioning might be desirable: e.g. when writing to disk, for 
better compression without the cost of global sort.

The current implementation piggybacks on the existing `RepartitionByExpression` 
`LogicalPlan` and simply adds the following logic: If its expressions are of 
type `SortOrder`, then it will do `RangePartitioning`; otherwise 
`HashPartitioning`. This was by far the least intrusive solution I could come 
up with.

## How was this patch tested?
Unit test for `RepartitionByExpression` changes, a test to ensure we're not 
changing the behavior of existing `.repartition()` and a few end-to-end tests 
in `DataFrameSuite`.

Author: Adrian Ionescu 

Closes #19828 from adrian-ionescu/repartitionByRange.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5f8e84d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5f8e84d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5f8e84d

Branch: refs/heads/master
Commit: f5f8e84d9d35751dad51490b6ae22931aa88db7b
Parents: bcceab6
Author: Adrian Ionescu 
Authored: Thu Nov 30 15:41:34 2017 -0800
Committer: gatorsmile 
Committed: Thu Nov 30 15:41:34 2017 -0800

--
 .../plans/logical/basicLogicalOperators.scala   | 20 +++
 .../sql/catalyst/analysis/AnalysisSuite.scala   | 26 +
 .../scala/org/apache/spark/sql/Dataset.scala| 57 ++--
 .../spark/sql/execution/SparkStrategies.scala   |  5 +-
 .../org/apache/spark/sql/DataFrameSuite.scala   | 57 
 5 files changed, 157 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5f8e84d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index c2750c3..93de7c1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -23,6 +23,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical.statsEstimation._
+import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning, RangePartitioning}
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 import org.apache.spark.util.random.RandomSampler
@@ -838,6 +839,25 @@ case class RepartitionByExpression(
 
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.")
 
+  val partitioning: Partitioning = {
+val (sortOrder, nonSortOrder) = 
partitionExpressions.partition(_.isInstanceOf[SortOrder])
+
+require(sortOrder.isEmpty || nonSortOrder.isEmpty,
+  s"${getClass.getSimpleName} expects that either all its 
`partitionExpressions` are of type " +
+"`SortOrder`, which means `RangePartitioning`, or none of them are 
`SortOrder`, which " +
+"means `HashPartitioning`. In this case we have:" +
+  s"""
+ |SortOrder: ${sortOrder}
+ |NonSortOrder: ${nonSortOrder}
+   """.stripMargin)
+
+if (sortOrder.nonEmpty) {
+  RangePartitioning(sortOrder.map(_.asInstanceOf[SortOrder]), 
numPartitions)
+} else {
+  HashPartitioning(nonSortOrder, numPartitions)
+}
+  }
+
   override def maxRows: Option[Long] = child.maxRows
   override def shuffle: Boolean = true
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f5f8e84d/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
index e56a5d6..0e2e706 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
@@

spark git commit: [SPARK-22489][SQL] Shouldn't change broadcast join buildSide if user clearly specified

2017-11-30 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 6ac57fd0d -> bcceab649


[SPARK-22489][SQL] Shouldn't change broadcast join buildSide if user clearly 
specified

## What changes were proposed in this pull request?

How to reproduce:
```scala
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value").createTempView("table2")

val bl = sql("SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
t1.key = t2.key").queryExecution.executedPlan

println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
```
The result is `BuildRight`, but should be `BuildLeft`. This PR fix this issue.
## How was this patch tested?

unit tests

Author: Yuming Wang 

Closes #19714 from wangyum/SPARK-22489.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bcceab64
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bcceab64
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bcceab64

Branch: refs/heads/master
Commit: bcceab649510a45f4c4b8e44b157c9987adff6f4
Parents: 6ac57fd
Author: Yuming Wang 
Authored: Thu Nov 30 15:36:26 2017 -0800
Committer: gatorsmile 
Committed: Thu Nov 30 15:36:26 2017 -0800

--
 docs/sql-programming-guide.md   | 58 
 .../spark/sql/execution/SparkStrategies.scala   | 67 ++-
 .../execution/joins/BroadcastJoinSuite.scala| 69 +++-
 3 files changed, 177 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bcceab64/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 983770d..a1b9c3b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1492,6 +1492,64 @@ that these options will be deprecated in future release 
as more optimizations ar
   
 
 
+## Broadcast Hint for SQL Queries
+
+The `BROADCAST` hint guides Spark to broadcast each specified table when 
joining them with another table or view.
+When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is 
preferred, 
+even if the statistics is above the configuration 
`spark.sql.autoBroadcastJoinThreshold`.
+When both sides of a join are specified, Spark broadcasts the one having the 
lower statistics.
+Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. 
full outer join) 
+support BHJ. When the broadcast nested loop join is selected, we still respect 
the hint.
+
+
+
+
+
+{% highlight scala %}
+import org.apache.spark.sql.functions.broadcast
+broadcast(spark.table("src")).join(spark.table("records"), "key").show()
+{% endhighlight %}
+
+
+
+
+
+{% highlight java %}
+import static org.apache.spark.sql.functions.broadcast;
+broadcast(spark.table("src")).join(spark.table("records"), "key").show();
+{% endhighlight %}
+
+
+
+
+
+{% highlight python %}
+from pyspark.sql.functions import broadcast
+broadcast(spark.table("src")).join(spark.table("records"), "key").show()
+{% endhighlight %}
+
+
+
+
+
+{% highlight r %}
+src <- sql("SELECT * FROM src")
+records <- sql("SELECT * FROM records")
+head(join(broadcast(src), records, src$key == records$key))
+{% endhighlight %}
+
+
+
+
+
+{% highlight sql %}
+-- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint
+SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key
+{% endhighlight %}
+
+
+
+
 # Distributed SQL Engine
 
 Spark SQL can also act as a distributed query engine using its JDBC/ODBC or 
command-line interface.

http://git-wip-us.apache.org/repos/asf/spark/blob/bcceab64/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
index 19b858f..1fe3cb1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.plans.physical._
 import org.apache.spark.sql.execution.columnar.{InMemoryRelation, 
InMemoryTableScanExec}
 import org.apache.spark.sql.execution.command._
 import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
-import org.apache.spark.sql.execution.joins.{BuildLeft, BuildRight}
+import org.apache.spark.sql.execution.joins.{BuildLeft, BuildRight, BuildSide}
 import

spark git commit: [SPARK-21417][SQL] Infer join conditions using propagated constraints

2017-11-30 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 999ec137a -> 6ac57fd0d


[SPARK-21417][SQL] Infer join conditions using propagated constraints

## What changes were proposed in this pull request?

This PR adds an optimization rule that infers join conditions using propagated 
constraints.

For instance, if there is a join, where the left relation has 'a = 1' and the 
right relation has 'b = 1', then the rule infers 'a = b' as a join predicate. 
Only semantically new predicates are appended to the existing join condition.

Refer to the corresponding ticket and tests for more details.

## How was this patch tested?

This patch comes with a new test suite to cover the implemented logic.

Author: aokolnychyi 

Closes #18692 from aokolnychyi/spark-21417.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ac57fd0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ac57fd0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ac57fd0

Branch: refs/heads/master
Commit: 6ac57fd0d1c82b834eb4bf0dd57596b92a99d6de
Parents: 999ec13
Author: aokolnychyi 
Authored: Thu Nov 30 14:25:10 2017 -0800
Committer: gatorsmile 
Committed: Thu Nov 30 14:25:10 2017 -0800

--
 .../expressions/EquivalentExpressionMap.scala   |  66 +
 .../catalyst/expressions/ExpressionSet.scala|   2 +
 .../sql/catalyst/optimizer/Optimizer.scala  |   1 +
 .../spark/sql/catalyst/optimizer/joins.scala|  60 +
 .../EquivalentExpressionMapSuite.scala  |  56 +
 .../optimizer/EliminateCrossJoinSuite.scala | 238 +++
 6 files changed, 423 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6ac57fd0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressionMap.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressionMap.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressionMap.scala
new file mode 100644
index 000..cf1614a
--- /dev/null
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressionMap.scala
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import scala.collection.mutable
+
+import 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressionMap.SemanticallyEqualExpr
+
+/**
+ * A class that allows you to map an expression into a set of equivalent 
expressions. The keys are
+ * handled based on their semantic meaning and ignoring cosmetic differences. 
The values are
+ * represented as [[ExpressionSet]]s.
+ *
+ * The underlying representation of keys depends on the 
[[Expression.semanticHash]] and
+ * [[Expression.semanticEquals]] methods.
+ *
+ * {{{
+ *   val map = new EquivalentExpressionMap()
+ *
+ *   map.put(1 + 2, a)
+ *   map.put(rand(), b)
+ *
+ *   map.get(2 + 1) => Set(a) // 1 + 2 and 2 + 1 are semantically equivalent
+ *   map.get(1 + 2) => Set(a) // 1 + 2 and 2 + 1 are semantically equivalent
+ *   map.get(rand()) => Set() // non-deterministic expressions are not 
equivalent
+ * }}}
+ */
+class EquivalentExpressionMap {
+
+  private val equivalenceMap = mutable.HashMap.empty[SemanticallyEqualExpr, 
ExpressionSet]
+
+  def put(expression: Expression, equivalentExpression: Expression): Unit = {
+val equivalentExpressions = equivalenceMap.getOrElseUpdate(expression, 
ExpressionSet.empty)
+equivalenceMap(expression) = equivalentExpressions + equivalentExpression
+  }
+
+  def get(expression: Expression): Set[Expression] =
+equivalenceMap.getOrElse(expression, ExpressionSet.empty)
+}
+
+object EquivalentExpressionMap {
+
+  private implicit class SemanticallyEqualExpr(val expr: Expression) {
+override def equals(obj: Any): Boolean = obj match {
+  case other: SemanticallyEqualExpr =>

spark git commit: [SPARK-22570][SQL] Avoid to create a lot of global variables by using a local variable with allocation of an object in generated code

2017-11-30 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 932bd09c8 -> 999ec137a


[SPARK-22570][SQL] Avoid to create a lot of global variables by using a local 
variable with allocation of an object in generated code

## What changes were proposed in this pull request?

This PR reduces # of global variables in generated code by replacing a global 
variable with a local variable with an allocation of an object every time. When 
a lot of global variables were generated, the generated code may meet 64K 
constant pool limit.
This PR reduces # of generated global variables in the following three 
operations:
* `Cast` with String to primitive byte/short/int/long
* `RegExpReplace`
* `CreateArray`

I intentionally leave [this 
part](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L595-L603).
 This is because this variable keeps a class that is dynamically generated. In 
other word, it is not possible to reuse one class.

## How was this patch tested?

Added test cases

Author: Kazuaki Ishizaki 

Closes #19797 from kiszk/SPARK-22570.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/999ec137
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/999ec137
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/999ec137

Branch: refs/heads/master
Commit: 999ec137a97844abbbd483dd98c7ded2f8ff356c
Parents: 932bd09
Author: Kazuaki Ishizaki 
Authored: Fri Dec 1 02:28:24 2017 +0800
Committer: Wenchen Fan 
Committed: Fri Dec 1 02:28:24 2017 +0800

--
 .../spark/sql/catalyst/expressions/Cast.scala   | 24 ++---
 .../expressions/complexTypeCreator.scala| 36 
 .../expressions/regexpExpressions.scala |  8 ++---
 .../sql/catalyst/expressions/CastSuite.scala|  8 +
 .../expressions/RegexpExpressionsSuite.scala| 11 +-
 .../catalyst/optimizer/complexTypesSuite.scala  |  7 
 6 files changed, 61 insertions(+), 33 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/999ec137/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index 8cafaef..f4ecbdb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
@@ -799,16 +799,16 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 
   private[this] def castToByteCode(from: DataType, ctx: CodegenContext): 
CastFunction = from match {
 case StringType =>
-  val wrapper = ctx.freshName("wrapper")
-  ctx.addMutableState("UTF8String.IntWrapper", wrapper,
-s"$wrapper = new UTF8String.IntWrapper();")
+  val wrapper = ctx.freshName("intWrapper")
   (c, evPrim, evNull) =>
 s"""
+  UTF8String.IntWrapper $wrapper = new UTF8String.IntWrapper();
   if ($c.toByte($wrapper)) {
 $evPrim = (byte) $wrapper.value;
   } else {
 $evNull = true;
   }
+  $wrapper = null;
 """
 case BooleanType =>
   (c, evPrim, evNull) => s"$evPrim = $c ? (byte) 1 : (byte) 0;"
@@ -826,16 +826,16 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   from: DataType,
   ctx: CodegenContext): CastFunction = from match {
 case StringType =>
-  val wrapper = ctx.freshName("wrapper")
-  ctx.addMutableState("UTF8String.IntWrapper", wrapper,
-s"$wrapper = new UTF8String.IntWrapper();")
+  val wrapper = ctx.freshName("intWrapper")
   (c, evPrim, evNull) =>
 s"""
+  UTF8String.IntWrapper $wrapper = new UTF8String.IntWrapper();
   if ($c.toShort($wrapper)) {
 $evPrim = (short) $wrapper.value;
   } else {
 $evNull = true;
   }
+  $wrapper = null;
 """
 case BooleanType =>
   (c, evPrim, evNull) => s"$evPrim = $c ? (short) 1 : (short) 0;"
@@ -851,16 +851,16 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 
   private[this] def castToIntCode(from: DataType, ctx: CodegenContext): 
CastFunction = from match {
 case StringType =>
-  val wrapper = ctx.freshName("wrapper")
-  ctx.addMutableState("UTF8String.IntWrapper", wrapper,
-s"$wrapper = new UTF8String.IntWrapper();")
+  val wrapper = ctx.freshName("intWrapper")
   (c, evPrim, evNull) =>
 s"""
+

spark git commit: [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files containing special characters

2017-11-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6eb203fae -> 932bd09c8


[SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files 
containing special characters

## What changes were proposed in this pull request?

SPARK-22146 fix the FileNotFoundException issue only for the `inferSchema` 
method, ie. only for the schema inference, but it doesn't fix the problem when 
actually reading the data. Thus nearly the same exception happens when someone 
tries to use the data. This PR covers fixing the problem also there.

## How was this patch tested?

enhanced UT

Author: Marco Gaido 

Closes #19844 from mgaido91/SPARK-22635.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/932bd09c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/932bd09c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/932bd09c

Branch: refs/heads/master
Commit: 932bd09c80dc2dc113e94f59f4dcb77e77de7c58
Parents: 6eb203f
Author: Marco Gaido 
Authored: Fri Dec 1 01:24:15 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:24:15 2017 +0900

--
 .../org/apache/spark/sql/hive/orc/OrcFileFormat.scala| 11 +--
 .../spark/sql/hive/MetastoreDataSourcesSuite.scala   |  3 ++-
 2 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/932bd09c/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
index 3b33a9f..95741c7 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
@@ -133,10 +133,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
 (file: PartitionedFile) => {
   val conf = broadcastedHadoopConf.value.value
 
+  val filePath = new Path(new URI(file.filePath))
+
   // SPARK-8501: Empty ORC files always have an empty schema stored in 
their footer. In this
   // case, `OrcFileOperator.readSchema` returns `None`, and we can't read 
the underlying file
   // using the given physical schema. Instead, we simply return an empty 
iterator.
-  val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), 
Some(conf)).isEmpty
+  val isEmptyFile = OrcFileOperator.readSchema(Seq(filePath.toString), 
Some(conf)).isEmpty
   if (isEmptyFile) {
 Iterator.empty
   } else {
@@ -146,15 +148,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val fileSplit = new FileSplit(
-new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
-  )
+  val fileSplit = new FileSplit(filePath, file.start, file.length, 
Array.empty)
   // Custom OrcRecordReader is used to get
   // ObjectInspector during recordReader creation itself and can
   // avoid NameNode call in unwrapOrcStructs per file.
   // Specifically would be helpful for partitioned datasets.
-  val orcReader = OrcFile.createReader(
-new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
+  val orcReader = OrcFile.createReader(filePath, 
OrcFile.readerOptions(conf))
   new SparkOrcNewRecordReader(orcReader, conf, fileSplit.getStart, 
fileSplit.getLength)
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/932bd09c/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
index a106047..c8caba8 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
@@ -1350,7 +1350,8 @@ class MetastoreDataSourcesSuite extends QueryTest with 
SQLTestUtils with TestHiv
   withTempDir { dir =>
 val tmpFile = s"$dir/$nameWithSpecialChars"
 spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile)
-spark.read.format(format).load(tmpFile)
+val fileContent = spark.read.format(format).load(tmpFile)
+checkAnswer(fileContent, Seq(Row("a"), Row("b")))
   }
 }
   }

spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

2017-11-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 38a0532cf -> d7b14746d


[SPARK-22654][TESTS] Retry Spark tarball download if failed in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

Adds a simple loop to retry download of Spark tarballs from different mirrors 
if the download fails.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #19851 from srowen/SPARK-22654.

(cherry picked from commit 6eb203fae7bbc9940710da40f314b89ffb4dd324)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7b14746
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7b14746
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7b14746

Branch: refs/heads/branch-2.2
Commit: d7b14746dd9bd488240174446bd158be1e30c250
Parents: 38a0532
Author: Sean Owen 
Authored: Fri Dec 1 01:21:52 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:22:06 2017 +0900

--
 .../hive/HiveExternalCatalogVersionsSuite.scala | 24 +++-
 1 file changed, 18 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7b14746/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 6859432..a3d5b94 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -20,6 +20,8 @@ package org.apache.spark.sql.hive
 import java.io.File
 import java.nio.file.Files
 
+import scala.sys.process._
+
 import org.apache.spark.TestUtils
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
@@ -50,14 +52,24 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 super.afterAll()
   }
 
-  private def downloadSpark(version: String): Unit = {
-import scala.sys.process._
+  private def tryDownloadSpark(version: String, path: String): Unit = {
+// Try mirrors a few times until one succeeds
+for (i <- 0 until 3) {
+  val preferredMirror =
+Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
+  val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
+  logInfo(s"Downloading Spark $version from $url")
+  if (Seq("wget", url, "-q", "-P", path).! == 0) {
+return
+  }
+  logWarning(s"Failed to download Spark $version from $url")
+}
+fail(s"Unable to download Spark $version")
+  }
 
-val preferredMirror =
-  Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
-val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
 
-Seq("wget", url, "-q", "-P", sparkTestingDir.getCanonicalPath).!
+  private def downloadSpark(version: String): Unit = {
+tryDownloadSpark(version, sparkTestingDir.getCanonicalPath)
 
 val downloaded = new File(sparkTestingDir, 
s"spark-$version-bin-hadoop2.7.tgz").getCanonicalPath
 val targetDir = new File(sparkTestingDir, 
s"spark-$version").getCanonicalPath


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

2017-11-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 9c29c5576 -> 6eb203fae


[SPARK-22654][TESTS] Retry Spark tarball download if failed in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

Adds a simple loop to retry download of Spark tarballs from different mirrors 
if the download fails.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #19851 from srowen/SPARK-22654.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6eb203fa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6eb203fa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6eb203fa

Branch: refs/heads/master
Commit: 6eb203fae7bbc9940710da40f314b89ffb4dd324
Parents: 9c29c55
Author: Sean Owen 
Authored: Fri Dec 1 01:21:52 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 1 01:21:52 2017 +0900

--
 .../hive/HiveExternalCatalogVersionsSuite.scala | 24 +++-
 1 file changed, 18 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6eb203fa/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 6859432..a3d5b94 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -20,6 +20,8 @@ package org.apache.spark.sql.hive
 import java.io.File
 import java.nio.file.Files
 
+import scala.sys.process._
+
 import org.apache.spark.TestUtils
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
@@ -50,14 +52,24 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 super.afterAll()
   }
 
-  private def downloadSpark(version: String): Unit = {
-import scala.sys.process._
+  private def tryDownloadSpark(version: String, path: String): Unit = {
+// Try mirrors a few times until one succeeds
+for (i <- 0 until 3) {
+  val preferredMirror =
+Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
+  val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
+  logInfo(s"Downloading Spark $version from $url")
+  if (Seq("wget", url, "-q", "-P", path).! == 0) {
+return
+  }
+  logWarning(s"Failed to download Spark $version from $url")
+}
+fail(s"Unable to download Spark $version")
+  }
 
-val preferredMirror =
-  Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, 
"-q", "-O", "-").!!.trim
-val url = 
s"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
 
-Seq("wget", url, "-q", "-P", sparkTestingDir.getCanonicalPath).!
+  private def downloadSpark(version: String): Unit = {
+tryDownloadSpark(version, sparkTestingDir.getCanonicalPath)
 
 val downloaded = new File(sparkTestingDir, 
s"spark-$version-bin-hadoop2.7.tgz").getCanonicalPath
 val targetDir = new File(sparkTestingDir, 
s"spark-$version").getCanonicalPath


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22643][SQL] ColumnarArray should be an immutable view

2017-11-30 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 444a2bbb6 -> 9c29c5576


[SPARK-22643][SQL] ColumnarArray should be an immutable view

## What changes were proposed in this pull request?

To make `ColumnVector` public, `ColumnarArray` need to be public too, and we 
should not have mutable public fields in a public class. This PR proposes to 
make `ColumnarArray` an immutable view of the data, and always create a new 
instance of `ColumnarArray` in `ColumnVector#getArray`

## How was this patch tested?

new benchmark in `ColumnarBatchBenchmark`

Author: Wenchen Fan 

Closes #19842 from cloud-fan/column-vector.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9c29c557
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9c29c557
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9c29c557

Branch: refs/heads/master
Commit: 9c29c557635caf739fde942f53255273aac0d7b1
Parents: 444a2bb
Author: Wenchen Fan 
Authored: Thu Nov 30 18:34:38 2017 +0800
Committer: Wenchen Fan 
Committed: Thu Nov 30 18:34:38 2017 +0800

--
 .../parquet/VectorizedColumnReader.java |   2 +-
 .../execution/vectorized/ArrowColumnVector.java |   1 -
 .../sql/execution/vectorized/ColumnVector.java  |  14 +-
 .../execution/vectorized/ColumnVectorUtils.java |  18 +--
 .../sql/execution/vectorized/ColumnarArray.java |  10 +-
 .../vectorized/OffHeapColumnVector.java |   2 +-
 .../vectorized/OnHeapColumnVector.java  |   2 +-
 .../vectorized/WritableColumnVector.java|  13 +-
 .../vectorized/ColumnVectorSuite.scala  |  41 +++---
 .../vectorized/ColumnarBatchBenchmark.scala | 142 +++
 .../vectorized/ColumnarBatchSuite.scala |  18 +--
 11 files changed, 164 insertions(+), 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9c29c557/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
index 0f1f470..71ca8b1 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
@@ -425,7 +425,7 @@ public class VectorizedColumnReader {
 // This is where we implement support for the valid type conversions.
 // TODO: implement remaining type conversions
 VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
-if (column.isArray()) {
+if (column.dataType() == DataTypes.StringType || column.dataType() == 
DataTypes.BinaryType) {
   defColumn.readBinarys(num, column, rowId, maxDefLevel, data);
 } else if (column.dataType() == DataTypes.TimestampType) {
   for (int i = 0; i < num; i++) {

http://git-wip-us.apache.org/repos/asf/spark/blob/9c29c557/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
index 5c502c9..0071bd6 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
@@ -315,7 +315,6 @@ public final class ArrowColumnVector extends ColumnVector {
 
   childColumns = new ArrowColumnVector[1];
   childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
-  resultArray = new ColumnarArray(childColumns[0]);
 } else if (vector instanceof MapVector) {
   MapVector mapVector = (MapVector) vector;
   accessor = new StructAccessor(mapVector);

http://git-wip-us.apache.org/repos/asf/spark/blob/9c29c557/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
index 940457f..cca1491 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
@@ -175,9 +175,7 @@ public abstract class

spark git commit: [SPARK-22652][SQL] remove set methods in ColumnarRow

2017-11-30 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 92cfbeeb5 -> 444a2bbb6


[SPARK-22652][SQL] remove set methods in ColumnarRow

## What changes were proposed in this pull request?

As a step to make `ColumnVector` public, the `ColumnarRow` returned by 
`ColumnVector#getStruct` should be immutable.

However we do need the mutability of `ColumnaRow` for the fast vectorized 
hashmap in hash aggregate. To solve this, this PR introduces a 
`MutableColumnarRow` for this use case.

## How was this patch tested?

existing test.

Author: Wenchen Fan 

Closes #19847 from cloud-fan/mutable-row.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/444a2bbb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/444a2bbb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/444a2bbb

Branch: refs/heads/master
Commit: 444a2bbb67c2548d121152bc922b4c3337ddc8e8
Parents: 92cfbee
Author: Wenchen Fan 
Authored: Thu Nov 30 18:28:58 2017 +0800
Committer: Wenchen Fan 
Committed: Thu Nov 30 18:28:58 2017 +0800

--
 .../sql/execution/vectorized/ColumnarRow.java   | 102 +--
 .../vectorized/MutableColumnarRow.java  | 278 +++
 .../execution/aggregate/HashAggregateExec.scala |   3 +-
 .../aggregate/VectorizedHashMapGenerator.scala  |  82 +++---
 .../vectorized/ColumnVectorSuite.scala  |  12 +
 .../vectorized/ColumnarBatchSuite.scala |  23 --
 6 files changed, 336 insertions(+), 164 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/444a2bbb/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarRow.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarRow.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarRow.java
index 98a9073..cabb747 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarRow.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarRow.java
@@ -16,8 +16,6 @@
  */
 package org.apache.spark.sql.execution.vectorized;
 
-import java.math.BigDecimal;
-
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
 import org.apache.spark.sql.catalyst.util.MapData;
@@ -32,17 +30,10 @@ import org.apache.spark.unsafe.types.UTF8String;
 public final class ColumnarRow extends InternalRow {
   protected int rowId;
   private final ColumnVector[] columns;
-  private final WritableColumnVector[] writableColumns;
 
   // Ctor used if this is a struct.
   ColumnarRow(ColumnVector[] columns) {
 this.columns = columns;
-this.writableColumns = new WritableColumnVector[this.columns.length];
-for (int i = 0; i < this.columns.length; i++) {
-  if (this.columns[i] instanceof WritableColumnVector) {
-this.writableColumns[i] = (WritableColumnVector) this.columns[i];
-  }
-}
   }
 
   public ColumnVector[] columns() { return columns; }
@@ -205,97 +196,8 @@ public final class ColumnarRow extends InternalRow {
   }
 
   @Override
-  public void update(int ordinal, Object value) {
-if (value == null) {
-  setNullAt(ordinal);
-} else {
-  DataType dt = columns[ordinal].dataType();
-  if (dt instanceof BooleanType) {
-setBoolean(ordinal, (boolean) value);
-  } else if (dt instanceof IntegerType) {
-setInt(ordinal, (int) value);
-  } else if (dt instanceof ShortType) {
-setShort(ordinal, (short) value);
-  } else if (dt instanceof LongType) {
-setLong(ordinal, (long) value);
-  } else if (dt instanceof FloatType) {
-setFloat(ordinal, (float) value);
-  } else if (dt instanceof DoubleType) {
-setDouble(ordinal, (double) value);
-  } else if (dt instanceof DecimalType) {
-DecimalType t = (DecimalType) dt;
-setDecimal(ordinal, Decimal.apply((BigDecimal) value, t.precision(), 
t.scale()),
-t.precision());
-  } else {
-throw new UnsupportedOperationException("Datatype not supported " + 
dt);
-  }
-}
-  }
-
-  @Override
-  public void setNullAt(int ordinal) {
-getWritableColumn(ordinal).putNull(rowId);
-  }
-
-  @Override
-  public void setBoolean(int ordinal, boolean value) {
-WritableColumnVector column = getWritableColumn(ordinal);
-column.putNotNull(rowId);
-column.putBoolean(rowId, value);
-  }
+  public void update(int ordinal, Object value) { throw new 
UnsupportedOperationException(); }
 
   @Override
-  public void setByte(int ordinal, byte value) {
-WritableColumnVector column = getWritableColumn(ordinal);
-column.putNotNull(rowId);
-

spark git commit: [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path

spark git commit: [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path

spark git commit: [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac…

spark git commit: [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac…

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

spark git commit: [SPARK-22373] Bump Janino dependency version to fix thread safety issue…

spark git commit: [SPARK-22428][DOC] Add spark application garbage collector configurat…

spark git commit: [SPARK-22614] Dataset API: repartitionByRange(...)

spark git commit: [SPARK-22489][SQL] Shouldn't change broadcast join buildSide if user clearly specified

spark git commit: [SPARK-21417][SQL] Infer join conditions using propagated constraints

spark git commit: [SPARK-22570][SQL] Avoid to create a lot of global variables by using a local variable with allocation of an object in generated code

spark git commit: [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files containing special characters

spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

spark git commit: [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite

spark git commit: [SPARK-22643][SQL] ColumnarArray should be an immutable view

spark git commit: [SPARK-22652][SQL] remove set methods in ColumnarRow

17 matches

Site Navigation

Mail list logo

Footer information