spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 96c011d5b -> 8ee93eed9


[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 
cache

This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious 
failures when the script was run on a machine with an empty `.m2` cache. The 
problem was that extra log output from the dependency download was conflicting 
with the grep / regex used to identify the classpath in the Maven output. This 
patch fixes this issue by adjusting the regex pattern.

Tested manually with the following reproduction of the bug:

```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```

Author: Josh Rosen 

Closes #13568 from JoshRosen/SPARK-12712.

(cherry picked from commit 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ee93eed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ee93eed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ee93eed

Branch: refs/heads/branch-2.0
Commit: 8ee93eed931b185b887882cc77c6fe8ddc907611
Parents: 96c011d
Author: Josh Rosen 
Authored: Thu Jun 9 00:51:24 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 00:51:51 2016 -0700

--
 dev/test-dependencies.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8ee93eed/dev/test-dependencies.sh
--
diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 5ea643e..28e3d4d 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -79,7 +79,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do
   echo "Generating dependency manifest for $HADOOP_PROFILE"
   mkdir -p dev/pr-deps
   $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath 
-pl assembly \
-| grep "Building Spark Project Assembly" -A 5 \
+| grep "Dependencies classpath:" -A 1 \
 | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \
 | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE
 done


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master d5807def1 -> 921fa40b1


[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 
cache

This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious 
failures when the script was run on a machine with an empty `.m2` cache. The 
problem was that extra log output from the dependency download was conflicting 
with the grep / regex used to identify the classpath in the Maven output. This 
patch fixes this issue by adjusting the regex pattern.

Tested manually with the following reproduction of the bug:

```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```

Author: Josh Rosen 

Closes #13568 from JoshRosen/SPARK-12712.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/921fa40b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/921fa40b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/921fa40b

Branch: refs/heads/master
Commit: 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa
Parents: d5807de
Author: Josh Rosen 
Authored: Thu Jun 9 00:51:24 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 00:51:24 2016 -0700

--
 dev/test-dependencies.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/921fa40b/dev/test-dependencies.sh
--
diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 5ea643e..28e3d4d 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -79,7 +79,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do
   echo "Generating dependency manifest for $HADOOP_PROFILE"
   mkdir -p dev/pr-deps
   $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath 
-pl assembly \
-| grep "Building Spark Project Assembly" -A 5 \
+| grep "Dependencies classpath:" -A 1 \
 | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \
 | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE
 done


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-1.6 5830828ef -> bb917fc65


[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 
cache

This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious 
failures when the script was run on a machine with an empty `.m2` cache. The 
problem was that extra log output from the dependency download was conflicting 
with the grep / regex used to identify the classpath in the Maven output. This 
patch fixes this issue by adjusting the regex pattern.

Tested manually with the following reproduction of the bug:

```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```

Author: Josh Rosen 

Closes #13568 from JoshRosen/SPARK-12712.

(cherry picked from commit 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb917fc6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb917fc6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb917fc6

Branch: refs/heads/branch-1.6
Commit: bb917fc659ec62718214f2f2fceb03a90515ac3e
Parents: 5830828
Author: Josh Rosen 
Authored: Thu Jun 9 00:51:24 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 00:52:43 2016 -0700

--
 dev/test-dependencies.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb917fc6/dev/test-dependencies.sh
--
diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index efb49f7..7367a8b 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -84,7 +84,7 @@ for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do
   echo "Generating dependency manifest for $HADOOP_PROFILE"
   mkdir -p dev/pr-deps
   $MVN $HADOOP_MODULE_PROFILES -P$HADOOP_PROFILE dependency:build-classpath 
-pl assembly \
-| grep "Building Spark Project Assembly" -A 5 \
+| grep "Dependencies classpath:" -A 1 \
 | tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort \
 | grep -v spark > dev/pr-deps/spark-deps-$HADOOP_PROFILE
 done


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2

2016-06-09 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 921fa40b1 -> 147c02082


[SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2

## What changes were proposed in this pull request?

Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build 
profile

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
Existing tests

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should use 
2.7.1 release and beyond."

Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
upon the previous release 2.7.0. This is the next stable release after Apache 
Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with 
OpenJDK, ideally this will be pushed to branch-2.0 and master.

Author: Adam Roberts 

Closes #13556 from a-roberts/patch-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/147c0208
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/147c0208
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/147c0208

Branch: refs/heads/master
Commit: 147c020823080c60b495f7950629d8134bf895db
Parents: 921fa40
Author: Adam Roberts 
Authored: Thu Jun 9 10:34:01 2016 +0100
Committer: Sean Owen 
Committed: Thu Jun 9 10:34:01 2016 +0100

--
 dev/deps/spark-deps-hadoop-2.4 | 30 +++---
 dev/deps/spark-deps-hadoop-2.6 | 30 +++---
 dev/deps/spark-deps-hadoop-2.7 | 30 +++---
 pom.xml|  6 +++---
 4 files changed, 48 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/147c0208/dev/deps/spark-deps-hadoop-2.4
--
diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4
index f0491ec..501bf58 100644
--- a/dev/deps/spark-deps-hadoop-2.4
+++ b/dev/deps/spark-deps-hadoop-2.4
@@ -53,21 +53,21 @@ eigenbase-properties-1.1.5.jar
 guava-14.0.1.jar
 guice-3.0.jar
 guice-servlet-3.0.jar
-hadoop-annotations-2.4.0.jar
-hadoop-auth-2.4.0.jar
-hadoop-client-2.4.0.jar
-hadoop-common-2.4.0.jar
-hadoop-hdfs-2.4.0.jar
-hadoop-mapreduce-client-app-2.4.0.jar
-hadoop-mapreduce-client-common-2.4.0.jar
-hadoop-mapreduce-client-core-2.4.0.jar
-hadoop-mapreduce-client-jobclient-2.4.0.jar
-hadoop-mapreduce-client-shuffle-2.4.0.jar
-hadoop-yarn-api-2.4.0.jar
-hadoop-yarn-client-2.4.0.jar
-hadoop-yarn-common-2.4.0.jar
-hadoop-yarn-server-common-2.4.0.jar
-hadoop-yarn-server-web-proxy-2.4.0.jar
+hadoop-annotations-2.4.1.jar
+hadoop-auth-2.4.1.jar
+hadoop-client-2.4.1.jar
+hadoop-common-2.4.1.jar
+hadoop-hdfs-2.4.1.jar
+hadoop-mapreduce-client-app-2.4.1.jar
+hadoop-mapreduce-client-common-2.4.1.jar
+hadoop-mapreduce-client-core-2.4.1.jar
+hadoop-mapreduce-client-jobclient-2.4.1.jar
+hadoop-mapreduce-client-shuffle-2.4.1.jar
+hadoop-yarn-api-2.4.1.jar
+hadoop-yarn-client-2.4.1.jar
+hadoop-yarn-common-2.4.1.jar
+hadoop-yarn-server-common-2.4.1.jar
+hadoop-yarn-server-web-proxy-2.4.1.jar
 hk2-api-2.4.0-b34.jar
 hk2-locator-2.4.0-b34.jar
 hk2-utils-2.4.0-b34.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/147c0208/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index b3dced6..b915727 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -58,21 +58,21 @@ gson-2.2.4.jar
 guava-14.0.1.jar
 guice-3.0.jar
 guice-servlet-3.0.jar
-hadoop-annotations-2.6.0.jar
-hadoop-auth-2.6.0.jar
-hadoop-client-2.6.0.jar
-hadoop-common-2.6.0.jar
-hadoop-hdfs-2.6.0.jar
-hadoop-mapreduce-client-app-2.6.0.jar
-hadoop-mapreduce-client-common-2.6.0.jar
-hadoop-mapreduce-client-core-2.6.0.jar
-hadoop-mapreduce-client-jobclient-2.6.0.jar
-hadoop-mapreduce-client-shuffle-2.6.0.jar
-hadoop-yarn-api-2.6.0.jar
-hadoop-yarn-client-2.6.0.jar
-hadoop-yarn-common-2.6.0.jar
-hadoop-yarn-server-common-2.6.0.jar
-hadoop-yarn-server-web-proxy-2.6.0.jar
+hadoop-annotations-2.6.4.jar
+hadoop-auth-2.6.4.jar
+hadoop-client-2.6.4.jar
+hadoop-common-2.6.4.jar
+hadoop-hdfs-2.6.4.jar
+hadoop-mapreduce-client-app-2.6.4.jar
+h

spark git commit: [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2

2016-06-09 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 8ee93eed9 -> 77c08d224


[SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2

## What changes were proposed in this pull request?

Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build 
profile

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
Existing tests

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should use 
2.7.1 release and beyond."

Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
upon the previous release 2.7.0. This is the next stable release after Apache 
Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with 
OpenJDK, ideally this will be pushed to branch-2.0 and master.

Author: Adam Roberts 

Closes #13556 from a-roberts/patch-2.

(cherry picked from commit 147c020823080c60b495f7950629d8134bf895db)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77c08d22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77c08d22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77c08d22

Branch: refs/heads/branch-2.0
Commit: 77c08d2240bef7d814fc6e4dd0a53fbdf1e2f795
Parents: 8ee93ee
Author: Adam Roberts 
Authored: Thu Jun 9 10:34:01 2016 +0100
Committer: Sean Owen 
Committed: Thu Jun 9 10:34:15 2016 +0100

--
 dev/deps/spark-deps-hadoop-2.4 | 30 +++---
 dev/deps/spark-deps-hadoop-2.6 | 30 +++---
 dev/deps/spark-deps-hadoop-2.7 | 30 +++---
 pom.xml|  6 +++---
 4 files changed, 48 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/77c08d22/dev/deps/spark-deps-hadoop-2.4
--
diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4
index 77d5266..3df292e 100644
--- a/dev/deps/spark-deps-hadoop-2.4
+++ b/dev/deps/spark-deps-hadoop-2.4
@@ -53,21 +53,21 @@ eigenbase-properties-1.1.5.jar
 guava-14.0.1.jar
 guice-3.0.jar
 guice-servlet-3.0.jar
-hadoop-annotations-2.4.0.jar
-hadoop-auth-2.4.0.jar
-hadoop-client-2.4.0.jar
-hadoop-common-2.4.0.jar
-hadoop-hdfs-2.4.0.jar
-hadoop-mapreduce-client-app-2.4.0.jar
-hadoop-mapreduce-client-common-2.4.0.jar
-hadoop-mapreduce-client-core-2.4.0.jar
-hadoop-mapreduce-client-jobclient-2.4.0.jar
-hadoop-mapreduce-client-shuffle-2.4.0.jar
-hadoop-yarn-api-2.4.0.jar
-hadoop-yarn-client-2.4.0.jar
-hadoop-yarn-common-2.4.0.jar
-hadoop-yarn-server-common-2.4.0.jar
-hadoop-yarn-server-web-proxy-2.4.0.jar
+hadoop-annotations-2.4.1.jar
+hadoop-auth-2.4.1.jar
+hadoop-client-2.4.1.jar
+hadoop-common-2.4.1.jar
+hadoop-hdfs-2.4.1.jar
+hadoop-mapreduce-client-app-2.4.1.jar
+hadoop-mapreduce-client-common-2.4.1.jar
+hadoop-mapreduce-client-core-2.4.1.jar
+hadoop-mapreduce-client-jobclient-2.4.1.jar
+hadoop-mapreduce-client-shuffle-2.4.1.jar
+hadoop-yarn-api-2.4.1.jar
+hadoop-yarn-client-2.4.1.jar
+hadoop-yarn-common-2.4.1.jar
+hadoop-yarn-server-common-2.4.1.jar
+hadoop-yarn-server-web-proxy-2.4.1.jar
 hk2-api-2.4.0-b34.jar
 hk2-locator-2.4.0-b34.jar
 hk2-utils-2.4.0-b34.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/77c08d22/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 9afe50f..9540f58 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -58,21 +58,21 @@ gson-2.2.4.jar
 guava-14.0.1.jar
 guice-3.0.jar
 guice-servlet-3.0.jar
-hadoop-annotations-2.6.0.jar
-hadoop-auth-2.6.0.jar
-hadoop-client-2.6.0.jar
-hadoop-common-2.6.0.jar
-hadoop-hdfs-2.6.0.jar
-hadoop-mapreduce-client-app-2.6.0.jar
-hadoop-mapreduce-client-common-2.6.0.jar
-hadoop-mapreduce-client-core-2.6.0.jar
-hadoop-mapreduce-client-jobclient-2.6.0.jar
-hadoop-mapreduce-client-shuffle-2.6.0.jar
-hadoop-yarn-api-2.6.0.jar
-hadoop-yarn-client-2.6.0.jar
-hadoop-yarn-common-2.6.0.jar
-hadoop-yarn-server-common-2.6.0.jar
-hadoop-yarn-server-web-proxy-2.6.0.jar
+hadoop-annotations-2.6.4.jar
+hadoop-auth-2.6.4.jar
+hadoop-cl

spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType

2016-06-09 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 77c08d224 -> eb9e8fc09


[SPARK-15804][SQL] Include metadata in the toStructType

## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the 
metadata when it builds the StructField, so it causes this reported problem 
https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK 
when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource 
through the InsertIntoHadoopFsRelationCommand, spark will build the 
WriteRelation container, and it will call the help function 'toStructType' to 
create StructType which contains StructField, it should include the metadata 
there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Kevin Yu 

Closes #13555 from kevinyu98/spark-15804.

(cherry picked from commit 99386fe3989f758844de14b2c28eccfdf8163221)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb9e8fc0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb9e8fc0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb9e8fc0

Branch: refs/heads/branch-2.0
Commit: eb9e8fc097384dbe0d2cb83ca5b80968e3539c78
Parents: 77c08d2
Author: Kevin Yu 
Authored: Thu Jun 9 09:50:09 2016 -0700
Committer: Wenchen Fan 
Committed: Thu Jun 9 09:50:19 2016 -0700

--
 .../spark/sql/catalyst/expressions/package.scala |  2 +-
 .../datasources/parquet/ParquetQuerySuite.scala  | 15 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
index 81f5bb4..a6125c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
@@ -91,7 +91,7 @@ package object expressions  {
   implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. */
 def toStructType: StructType = {
-  StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable)))
+  StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, 
a.metadata)))
 }
 
 // It's possible that `attrs` is a linked list, which can lead to bad 
O(n^2) loops when

http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
index 78b97f6..ea57f71 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest 
with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { df =>
+assert(df.schema.last.metadata.getString("key") == "value")
+  }
+}
+  }
 }
 
 object TestingUDT {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType

2016-06-09 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 147c02082 -> 99386fe39


[SPARK-15804][SQL] Include metadata in the toStructType

## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the 
metadata when it builds the StructField, so it causes this reported problem 
https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK 
when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource 
through the InsertIntoHadoopFsRelationCommand, spark will build the 
WriteRelation container, and it will call the help function 'toStructType' to 
create StructType which contains StructField, it should include the metadata 
there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Kevin Yu 

Closes #13555 from kevinyu98/spark-15804.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99386fe3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99386fe3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99386fe3

Branch: refs/heads/master
Commit: 99386fe3989f758844de14b2c28eccfdf8163221
Parents: 147c020
Author: Kevin Yu 
Authored: Thu Jun 9 09:50:09 2016 -0700
Committer: Wenchen Fan 
Committed: Thu Jun 9 09:50:09 2016 -0700

--
 .../spark/sql/catalyst/expressions/package.scala |  2 +-
 .../datasources/parquet/ParquetQuerySuite.scala  | 15 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/99386fe3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
index 81f5bb4..a6125c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
@@ -91,7 +91,7 @@ package object expressions  {
   implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. */
 def toStructType: StructType = {
-  StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable)))
+  StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, 
a.metadata)))
 }
 
 // It's possible that `attrs` is a linked list, which can lead to bad 
O(n^2) loops when

http://git-wip-us.apache.org/repos/asf/spark/blob/99386fe3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
index 78b97f6..ea57f71 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest 
with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { df =>
+assert(df.schema.last.metadata.getString("key") == "value")
+  }
+}
+  }
 }
 
 object TestingUDT {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property

2016-06-09 Thread mlnick
Repository: spark
Updated Branches:
  refs/heads/master 99386fe39 -> e594b4928


[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property

## What changes were proposed in this pull request?

add method idf to IDF in pyspark

## How was this patch tested?

add unit test

Author: Jeff Zhang 

Closes #13540 from zjffdu/SPARK-15788.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e594b492
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e594b492
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e594b492

Branch: refs/heads/master
Commit: e594b492836988ef3d9487b511368c70169d1ecd
Parents: 99386fe
Author: Jeff Zhang 
Authored: Thu Jun 9 09:54:38 2016 -0700
Committer: Nick Pentreath 
Committed: Thu Jun 9 09:54:38 2016 -0700

--
 python/pyspark/ml/feature.py | 10 ++
 1 file changed, 10 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e594b492/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 1aff2e5..ebe1300 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -585,6 +585,8 @@ class IDF(JavaEstimator, HasInputCol, HasOutputCol, 
JavaMLReadable, JavaMLWritab
 ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
 >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf")
 >>> model = idf.fit(df)
+>>> model.idf
+DenseVector([0.0, 0.0])
 >>> model.transform(df).head().idf
 DenseVector([0.0, 0.0])
 >>> 
idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs
@@ -658,6 +660,14 @@ class IDFModel(JavaModel, JavaMLReadable, JavaMLWritable):
 .. versionadded:: 1.4.0
 """
 
+@property
+@since("2.0.0")
+def idf(self):
+"""
+Returns the IDF vector.
+"""
+return self._call_java("idf")
+
 
 @inherit_doc
 class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, 
JavaMLWritable):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property

2016-06-09 Thread mlnick
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 eb9e8fc09 -> 10f759947


[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property

## What changes were proposed in this pull request?

add method idf to IDF in pyspark

## How was this patch tested?

add unit test

Author: Jeff Zhang 

Closes #13540 from zjffdu/SPARK-15788.

(cherry picked from commit e594b492836988ef3d9487b511368c70169d1ecd)
Signed-off-by: Nick Pentreath 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10f75994
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10f75994
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10f75994

Branch: refs/heads/branch-2.0
Commit: 10f7599471dd0d2b5efb49c5e1664fa24dfee074
Parents: eb9e8fc
Author: Jeff Zhang 
Authored: Thu Jun 9 09:54:38 2016 -0700
Committer: Nick Pentreath 
Committed: Thu Jun 9 09:54:47 2016 -0700

--
 python/pyspark/ml/feature.py | 10 ++
 1 file changed, 10 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10f75994/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 1aff2e5..ebe1300 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -585,6 +585,8 @@ class IDF(JavaEstimator, HasInputCol, HasOutputCol, 
JavaMLReadable, JavaMLWritab
 ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
 >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf")
 >>> model = idf.fit(df)
+>>> model.idf
+DenseVector([0.0, 0.0])
 >>> model.transform(df).head().idf
 DenseVector([0.0, 0.0])
 >>> 
idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs
@@ -658,6 +660,14 @@ class IDFModel(JavaModel, JavaMLReadable, JavaMLWritable):
 .. versionadded:: 1.4.0
 """
 
+@property
+@since("2.0.0")
+def idf(self):
+"""
+Returns the IDF vector.
+"""
+return self._call_java("idf")
+
 
 @inherit_doc
 class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, 
JavaMLWritable):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master e594b4928 -> f74b77713


[SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
depends on that fork via a SBT subproject which is cloned from 
https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
unnecessarily slows down the initial build on fresh machines and is also risky 
because it risks a build breakage in case that GitHub repository ever changes 
or is deleted.

In order to address these issues, I have published a pre-built binary of our 
forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` 
namespace and have updated Spark's build to use that artifact. This published 
artifact was built from 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains 
the contents of ScrapCodes's branch plus an additional patch to configure the 
build for artifact publication.

/cc srowen ScrapCodes for review.

Author: Josh Rosen 

Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f74b7771
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f74b7771
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f74b7771

Branch: refs/heads/master
Commit: f74b77713e17960dddb7459eabfdc19f08f4024b
Parents: e594b49
Author: Josh Rosen 
Authored: Thu Jun 9 11:04:08 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 11:04:08 2016 -0700

--
 project/plugins.sbt|  9 +
 project/project/SparkPluginBuild.scala | 28 
 2 files changed, 9 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f74b7771/project/plugins.sbt
--
diff --git a/project/plugins.sbt b/project/plugins.sbt
index 4578b56..8bebd7b 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -21,3 +21,12 @@ libraryDependencies += "org.ow2.asm"  % "asm" % "5.0.3"
 libraryDependencies += "org.ow2.asm"  % "asm-commons" % "5.0.3"
 
 addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11")
+
+// Spark uses a custom fork of the sbt-pom-reader plugin which contains a 
patch to fix issues
+// related to test-jar dependencies 
(https://github.com/sbt/sbt-pom-reader/pull/14). The source for
+// this fork is published at 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark
+// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that 
repository.
+// In the long run, we should try to merge our patch upstream and switch to an 
upstream version of
+// the plugin; this is tracked at SPARK-14401.
+
+addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark")

http://git-wip-us.apache.org/repos/asf/spark/blob/f74b7771/project/project/SparkPluginBuild.scala
--
diff --git a/project/project/SparkPluginBuild.scala 
b/project/project/SparkPluginBuild.scala
deleted file mode 100644
index cbb88dc..000
--- a/project/project/SparkPluginBuild.scala
+++ /dev/null
@@ -1,28 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-import sbt._
-import sbt.Keys._
-
-/**
- * This plugin project is there because we use our custom fork of 
sbt-pom-reader plugin. This is
- * a plugin project so that this gets compiled first and is available on the 
classpath for SBT build.
- */
-object SparkPluginDef extends Build {
-  lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader)
-  lazy val sbtPomReader = 
uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";)
-}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 10f759947 -> 07a914c09


[SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
depends on that fork via a SBT subproject which is cloned from 
https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
unnecessarily slows down the initial build on fresh machines and is also risky 
because it risks a build breakage in case that GitHub repository ever changes 
or is deleted.

In order to address these issues, I have published a pre-built binary of our 
forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` 
namespace and have updated Spark's build to use that artifact. This published 
artifact was built from 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains 
the contents of ScrapCodes's branch plus an additional patch to configure the 
build for artifact publication.

/cc srowen ScrapCodes for review.

Author: Josh Rosen 

Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.

(cherry picked from commit f74b77713e17960dddb7459eabfdc19f08f4024b)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07a914c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07a914c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07a914c0

Branch: refs/heads/branch-2.0
Commit: 07a914c09767deb67a0088b54ee48929a8567374
Parents: 10f7599
Author: Josh Rosen 
Authored: Thu Jun 9 11:04:08 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 11:04:27 2016 -0700

--
 project/plugins.sbt|  9 +
 project/project/SparkPluginBuild.scala | 28 
 2 files changed, 9 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07a914c0/project/plugins.sbt
--
diff --git a/project/plugins.sbt b/project/plugins.sbt
index 4578b56..8bebd7b 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -21,3 +21,12 @@ libraryDependencies += "org.ow2.asm"  % "asm" % "5.0.3"
 libraryDependencies += "org.ow2.asm"  % "asm-commons" % "5.0.3"
 
 addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11")
+
+// Spark uses a custom fork of the sbt-pom-reader plugin which contains a 
patch to fix issues
+// related to test-jar dependencies 
(https://github.com/sbt/sbt-pom-reader/pull/14). The source for
+// this fork is published at 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark
+// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that 
repository.
+// In the long run, we should try to merge our patch upstream and switch to an 
upstream version of
+// the plugin; this is tracked at SPARK-14401.
+
+addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark")

http://git-wip-us.apache.org/repos/asf/spark/blob/07a914c0/project/project/SparkPluginBuild.scala
--
diff --git a/project/project/SparkPluginBuild.scala 
b/project/project/SparkPluginBuild.scala
deleted file mode 100644
index cbb88dc..000
--- a/project/project/SparkPluginBuild.scala
+++ /dev/null
@@ -1,28 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-import sbt._
-import sbt.Keys._
-
-/**
- * This plugin project is there because we use our custom fork of 
sbt-pom-reader plugin. This is
- * a plugin project so that this gets compiled first and is available on the 
classpath for SBT build.
- */
-object SparkPluginDef extends Build {
-  lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader)
-  lazy val sbtPomReader = 
uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";)
-}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-1.6 bb917fc65 -> 739d992f0


[SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central

Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
depends on that fork via a SBT subproject which is cloned from 
https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
unnecessarily slows down the initial build on fresh machines and is also risky 
because it risks a build breakage in case that GitHub repository ever changes 
or is deleted.

In order to address these issues, I have published a pre-built binary of our 
forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` 
namespace and have updated Spark's build to use that artifact. This published 
artifact was built from 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains 
the contents of ScrapCodes's branch plus an additional patch to configure the 
build for artifact publication.

/cc srowen ScrapCodes for review.

Author: Josh Rosen 

Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.

(cherry picked from commit f74b77713e17960dddb7459eabfdc19f08f4024b)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/739d992f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/739d992f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/739d992f

Branch: refs/heads/branch-1.6
Commit: 739d992f041b995fbf44b93cf47bced3d3811ad9
Parents: bb917fc
Author: Josh Rosen 
Authored: Thu Jun 9 11:04:08 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 11:08:22 2016 -0700

--
 project/plugins.sbt|  9 +
 project/project/SparkPluginBuild.scala | 28 
 2 files changed, 9 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/739d992f/project/plugins.sbt
--
diff --git a/project/plugins.sbt b/project/plugins.sbt
index c06687d..6a4e401 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -32,3 +32,12 @@ addSbtPlugin("io.spray" % "sbt-revolver" % "0.7.2")
 libraryDependencies += "org.ow2.asm"  % "asm" % "5.0.3"
 
 libraryDependencies += "org.ow2.asm"  % "asm-commons" % "5.0.3"
+
+// Spark uses a custom fork of the sbt-pom-reader plugin which contains a 
patch to fix issues
+// related to test-jar dependencies 
(https://github.com/sbt/sbt-pom-reader/pull/14). The source for
+// this fork is published at 
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark
+// and corresponds to commit b160317fcb0b9d1009635a7c5aa05d0f3be61936 in that 
repository.
+// In the long run, we should try to merge our patch upstream and switch to an 
upstream version of
+// the plugin; this is tracked at SPARK-14401.
+
+addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark")

http://git-wip-us.apache.org/repos/asf/spark/blob/739d992f/project/project/SparkPluginBuild.scala
--
diff --git a/project/project/SparkPluginBuild.scala 
b/project/project/SparkPluginBuild.scala
deleted file mode 100644
index cbb88dc..000
--- a/project/project/SparkPluginBuild.scala
+++ /dev/null
@@ -1,28 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-import sbt._
-import sbt.Keys._
-
-/**
- * This plugin project is there because we use our custom fork of 
sbt-pom-reader plugin. This is
- * a plugin project so that this gets compiled first and is available on the 
classpath for SBT build.
- */
-object SparkPluginDef extends Build {
-  lazy val root = Project("plugins", file(".")) dependsOn(sbtPomReader)
-  lazy val sbtPomReader = 
uri("https://github.com/ScrapCodes/sbt-pom-reader.git#ignore_artifact_id";)
-}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set

2016-06-09 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master f74b77713 -> 6cb71f473


[SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set

## What changes were proposed in this pull request?

It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in 
the build: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
 It seems that passing `-javabootclasspath` to ScalaDoc using 
scala-maven-plugin ends up preventing the Scala library classes from being 
added to scalac's internal class path, causing compilation errors while 
building doc-jars.

There might be a principled fix to this inside of the scala-maven-plugin 
itself, but for now this patch configures the build to omit the 
`-javabootclasspath` option during Maven doc-jar generation.

## How was this patch tested?

Tested manually with `build/mvn clean install -DskipTests=true` when 
`JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify 
that the final POM changes were scoped correctly: 
https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653

/cc vanzin and yhuai for review.

Author: Josh Rosen 

Closes #13573 from JoshRosen/SPARK-15839.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6cb71f47
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6cb71f47
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6cb71f47

Branch: refs/heads/master
Commit: 6cb71f4733a920d916b91c66bb2a508a21883b16
Parents: f74b777
Author: Josh Rosen 
Authored: Thu Jun 9 12:32:29 2016 -0700
Committer: Yin Huai 
Committed: Thu Jun 9 12:32:29 2016 -0700

--
 pom.xml | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6cb71f47/pom.xml
--
diff --git a/pom.xml b/pom.xml
index fd40683..89ed87f 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2605,12 +2605,29 @@
 
   net.alchim31.maven
   scala-maven-plugin
-  
-
-  -javabootclasspath
-  ${env.JAVA_7_HOME}/jre/lib/rt.jar
-
-  
+  
+  
+
+  scala-compile-first
+  
+
+  -javabootclasspath
+  ${env.JAVA_7_HOME}/jre/lib/rt.jar
+
+  
+
+
+  scala-test-compile-first
+  
+
+  -javabootclasspath
+  ${env.JAVA_7_HOME}/jre/lib/rt.jar
+
+  
+
+  
 
   
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set

2016-06-09 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 07a914c09 -> 0408793aa


[SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set

## What changes were proposed in this pull request?

It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in 
the build: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
 It seems that passing `-javabootclasspath` to ScalaDoc using 
scala-maven-plugin ends up preventing the Scala library classes from being 
added to scalac's internal class path, causing compilation errors while 
building doc-jars.

There might be a principled fix to this inside of the scala-maven-plugin 
itself, but for now this patch configures the build to omit the 
`-javabootclasspath` option during Maven doc-jar generation.

## How was this patch tested?

Tested manually with `build/mvn clean install -DskipTests=true` when 
`JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify 
that the final POM changes were scoped correctly: 
https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653

/cc vanzin and yhuai for review.

Author: Josh Rosen 

Closes #13573 from JoshRosen/SPARK-15839.

(cherry picked from commit 6cb71f4733a920d916b91c66bb2a508a21883b16)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0408793a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0408793a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0408793a

Branch: refs/heads/branch-2.0
Commit: 0408793aaab177920363913c9e91ac3af695370a
Parents: 07a914c
Author: Josh Rosen 
Authored: Thu Jun 9 12:32:29 2016 -0700
Committer: Yin Huai 
Committed: Thu Jun 9 12:33:00 2016 -0700

--
 pom.xml | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0408793a/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 0419896..7f9ea44 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2605,12 +2605,29 @@
 
   net.alchim31.maven
   scala-maven-plugin
-  
-
-  -javabootclasspath
-  ${env.JAVA_7_HOME}/jre/lib/rt.jar
-
-  
+  
+  
+
+  scala-compile-first
+  
+
+  -javabootclasspath
+  ${env.JAVA_7_HOME}/jre/lib/rt.jar
+
+  
+
+
+  scala-test-compile-first
+  
+
+  -javabootclasspath
+  ${env.JAVA_7_HOME}/jre/lib/rt.jar
+
+  
+
+  
 
   
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 0408793aa -> b42e3d886


[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date 
functions

## What changes were proposed in this pull request?
The current implementations of `UnixTime` and `FromUnixTime` do not cache their 
parser/formatter as much as they could. This PR resolved this issue.

This PR is a take over from https://github.com/apache/spark/pull/13522 and 
further optimizes the re-use of the parser/formatter. It also fixes the 
improves handling (catching the actual exception instead of `Throwable`). All 
credits for this work should go to rajeshbalamohan.

This PR closes https://github.com/apache/spark/pull/13522

## How was this patch tested?
Current tests.

Author: Herman van Hovell 
Author: Rajesh Balamohan 

Closes #13581 from hvanhovell/SPARK-14321.

(cherry picked from commit b0768538e56e5bbda7aaabbe2a0197e30ba5f993)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b42e3d88
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b42e3d88
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b42e3d88

Branch: refs/heads/branch-2.0
Commit: b42e3d886f1688feb6e9ca85a60ce5e6295e8489
Parents: 0408793
Author: Herman van Hovell 
Authored: Thu Jun 9 16:37:18 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 16:37:29 2016 -0700

--
 .../expressions/datetimeExpressions.scala   | 48 ++--
 1 file changed, 24 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b42e3d88/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 69c32f4..773431d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -399,6 +399,8 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
   override def nullable: Boolean = true
 
   private lazy val constFormat: UTF8String = 
right.eval().asInstanceOf[UTF8String]
+  private lazy val formatter: SimpleDateFormat =
+Try(new SimpleDateFormat(constFormat.toString)).getOrElse(null)
 
   override def eval(input: InternalRow): Any = {
 val t = left.eval(input)
@@ -411,11 +413,11 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 case TimestampType =>
   t.asInstanceOf[Long] / 100L
 case StringType if right.foldable =>
-  if (constFormat != null) {
-Try(new SimpleDateFormat(constFormat.toString).parse(
-  t.asInstanceOf[UTF8String].toString).getTime / 
1000L).getOrElse(null)
-  } else {
+  if (constFormat == null || formatter == null) {
 null
+  } else {
+Try(formatter.parse(
+  t.asInstanceOf[UTF8String].toString).getTime / 
1000L).getOrElse(null)
   }
 case StringType =>
   val f = right.eval(input)
@@ -434,13 +436,10 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 left.dataType match {
   case StringType if right.foldable =>
 val sdf = classOf[SimpleDateFormat].getName
-val fString = if (constFormat == null) null else constFormat.toString
-val formatter = ctx.freshName("formatter")
-if (fString == null) {
-  ev.copy(code = s"""
-boolean ${ev.isNull} = true;
-${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};""")
+if (formatter == null) {
+  ExprCode("", "true", ctx.defaultValue(dataType))
 } else {
+  val formatterName = ctx.addReferenceObj("formatter", formatter, sdf)
   val eval1 = left.genCode(ctx)
   ev.copy(code = s"""
 ${eval1.code}
@@ -448,10 +447,8 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 ${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
 if (!${ev.isNull}) {
   try {
-$sdf $formatter = new $sdf("$fString");
-${ev.value} =
-  $formatter.parse(${eval1.value}.toString()).getTime() / 
1000L;
-  } catch (java.lang.Throwable e) {
+${ev.value} = 
$formatterName.parse(${eval1.value}.toString()).getTime() / 1000L;
+  } catch (java.text.ParseException e) {
 $

spark git commit: [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 6cb71f473 -> b0768538e


[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date 
functions

## What changes were proposed in this pull request?
The current implementations of `UnixTime` and `FromUnixTime` do not cache their 
parser/formatter as much as they could. This PR resolved this issue.

This PR is a take over from https://github.com/apache/spark/pull/13522 and 
further optimizes the re-use of the parser/formatter. It also fixes the 
improves handling (catching the actual exception instead of `Throwable`). All 
credits for this work should go to rajeshbalamohan.

This PR closes https://github.com/apache/spark/pull/13522

## How was this patch tested?
Current tests.

Author: Herman van Hovell 
Author: Rajesh Balamohan 

Closes #13581 from hvanhovell/SPARK-14321.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0768538
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0768538
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0768538

Branch: refs/heads/master
Commit: b0768538e56e5bbda7aaabbe2a0197e30ba5f993
Parents: 6cb71f4
Author: Herman van Hovell 
Authored: Thu Jun 9 16:37:18 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 16:37:18 2016 -0700

--
 .../expressions/datetimeExpressions.scala   | 48 ++--
 1 file changed, 24 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b0768538/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 69c32f4..773431d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -399,6 +399,8 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
   override def nullable: Boolean = true
 
   private lazy val constFormat: UTF8String = 
right.eval().asInstanceOf[UTF8String]
+  private lazy val formatter: SimpleDateFormat =
+Try(new SimpleDateFormat(constFormat.toString)).getOrElse(null)
 
   override def eval(input: InternalRow): Any = {
 val t = left.eval(input)
@@ -411,11 +413,11 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 case TimestampType =>
   t.asInstanceOf[Long] / 100L
 case StringType if right.foldable =>
-  if (constFormat != null) {
-Try(new SimpleDateFormat(constFormat.toString).parse(
-  t.asInstanceOf[UTF8String].toString).getTime / 
1000L).getOrElse(null)
-  } else {
+  if (constFormat == null || formatter == null) {
 null
+  } else {
+Try(formatter.parse(
+  t.asInstanceOf[UTF8String].toString).getTime / 
1000L).getOrElse(null)
   }
 case StringType =>
   val f = right.eval(input)
@@ -434,13 +436,10 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 left.dataType match {
   case StringType if right.foldable =>
 val sdf = classOf[SimpleDateFormat].getName
-val fString = if (constFormat == null) null else constFormat.toString
-val formatter = ctx.freshName("formatter")
-if (fString == null) {
-  ev.copy(code = s"""
-boolean ${ev.isNull} = true;
-${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};""")
+if (formatter == null) {
+  ExprCode("", "true", ctx.defaultValue(dataType))
 } else {
+  val formatterName = ctx.addReferenceObj("formatter", formatter, sdf)
   val eval1 = left.genCode(ctx)
   ev.copy(code = s"""
 ${eval1.code}
@@ -448,10 +447,8 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
 ${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
 if (!${ev.isNull}) {
   try {
-$sdf $formatter = new $sdf("$fString");
-${ev.value} =
-  $formatter.parse(${eval1.value}.toString()).getTime() / 
1000L;
-  } catch (java.lang.Throwable e) {
+${ev.value} = 
$formatterName.parse(${eval1.value}.toString()).getTime() / 1000L;
+  } catch (java.text.ParseException e) {
 ${ev.isNull} = true;
   }
 }""")
@@ -463,7 +460,9 @@ abstract class UnixTime extend

spark git commit: [SPARK-12447][YARN] Only update the states when executor is successfully launched

2016-06-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master b0768538e -> aa0364510


[SPARK-12447][YARN] Only update the states when executor is successfully 
launched

The details is described in https://issues.apache.org/jira/browse/SPARK-12447.

vanzin Please help to review, thanks a lot.

Author: jerryshao 

Closes #10412 from jerryshao/SPARK-12447.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa036451
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa036451
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa036451

Branch: refs/heads/master
Commit: aa0364510792c18a0973b6096cd38f611fc1c1a6
Parents: b076853
Author: jerryshao 
Authored: Thu Jun 9 17:31:19 2016 -0700
Committer: Marcelo Vanzin 
Committed: Thu Jun 9 17:31:19 2016 -0700

--
 .../spark/deploy/yarn/ExecutorRunnable.scala|  5 +-
 .../spark/deploy/yarn/YarnAllocator.scala   | 72 
 2 files changed, 47 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa036451/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
index fc753b7..3d0e996 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
@@ -55,15 +55,14 @@ private[yarn] class ExecutorRunnable(
 executorCores: Int,
 appId: String,
 securityMgr: SecurityManager,
-localResources: Map[String, LocalResource])
-  extends Runnable with Logging {
+localResources: Map[String, LocalResource]) extends Logging {
 
   var rpc: YarnRPC = YarnRPC.create(conf)
   var nmClient: NMClient = _
   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
   lazy val env = prepareEnvironment(container)
 
-  override def run(): Unit = {
+  def run(): Unit = {
 logInfo("Starting Executor Container")
 nmClient = NMClient.createNMClient()
 nmClient.init(yarnConf)

http://git-wip-us.apache.org/repos/asf/spark/blob/aa036451/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index 066c665..b110d82 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -24,6 +24,7 @@ import java.util.regex.Pattern
 import scala.collection.mutable
 import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue}
 import scala.collection.JavaConverters._
+import scala.util.control.NonFatal
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.yarn.api.records._
@@ -472,41 +473,58 @@ private[yarn] class YarnAllocator(
*/
   private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
 for (container <- containersToUse) {
-  numExecutorsRunning += 1
-  assert(numExecutorsRunning <= targetNumExecutors)
+  executorIdCounter += 1
   val executorHostname = container.getNodeId.getHost
   val containerId = container.getId
-  executorIdCounter += 1
   val executorId = executorIdCounter.toString
-
   assert(container.getResource.getMemory >= resource.getMemory)
-
   logInfo("Launching container %s for on host %s".format(containerId, 
executorHostname))
-  executorIdToContainer(executorId) = container
-  containerIdToExecutorId(container.getId) = executorId
-
-  val containerSet = 
allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
-new HashSet[ContainerId])
-
-  containerSet += containerId
-  allocatedContainerToHostMap.put(containerId, executorHostname)
-
-  val executorRunnable = new ExecutorRunnable(
-container,
-conf,
-sparkConf,
-driverUrl,
-executorId,
-executorHostname,
-executorMemory,
-executorCores,
-appAttemptId.getApplicationId.toString,
-securityMgr,
-localResources)
+
+  def updateInternalState(): Unit = synchronized {
+numExecutorsRunning += 1
+assert(numExecutorsRunning <= targetNumExecutors)
+executorIdToContainer(executorId) = container
+containerIdToExecutorId(container.getId) = executorId
+
+val containerSet = 
allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
+  new HashSet[ContainerId])
+containerSet += containerId
+al

spark git commit: [SPARK-12447][YARN] Only update the states when executor is successfully launched

2016-06-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b42e3d886 -> b2d076c35


[SPARK-12447][YARN] Only update the states when executor is successfully 
launched

The details is described in https://issues.apache.org/jira/browse/SPARK-12447.

vanzin Please help to review, thanks a lot.

Author: jerryshao 

Closes #10412 from jerryshao/SPARK-12447.

(cherry picked from commit aa0364510792c18a0973b6096cd38f611fc1c1a6)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2d076c3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2d076c3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2d076c3

Branch: refs/heads/branch-2.0
Commit: b2d076c35d801219a25b420dd373383c08859a82
Parents: b42e3d8
Author: jerryshao 
Authored: Thu Jun 9 17:31:19 2016 -0700
Committer: Marcelo Vanzin 
Committed: Thu Jun 9 17:31:41 2016 -0700

--
 .../spark/deploy/yarn/ExecutorRunnable.scala|  5 +-
 .../spark/deploy/yarn/YarnAllocator.scala   | 72 
 2 files changed, 47 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2d076c3/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
index fc753b7..3d0e996 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
@@ -55,15 +55,14 @@ private[yarn] class ExecutorRunnable(
 executorCores: Int,
 appId: String,
 securityMgr: SecurityManager,
-localResources: Map[String, LocalResource])
-  extends Runnable with Logging {
+localResources: Map[String, LocalResource]) extends Logging {
 
   var rpc: YarnRPC = YarnRPC.create(conf)
   var nmClient: NMClient = _
   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
   lazy val env = prepareEnvironment(container)
 
-  override def run(): Unit = {
+  def run(): Unit = {
 logInfo("Starting Executor Container")
 nmClient = NMClient.createNMClient()
 nmClient.init(yarnConf)

http://git-wip-us.apache.org/repos/asf/spark/blob/b2d076c3/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index 066c665..b110d82 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -24,6 +24,7 @@ import java.util.regex.Pattern
 import scala.collection.mutable
 import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue}
 import scala.collection.JavaConverters._
+import scala.util.control.NonFatal
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.yarn.api.records._
@@ -472,41 +473,58 @@ private[yarn] class YarnAllocator(
*/
   private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
 for (container <- containersToUse) {
-  numExecutorsRunning += 1
-  assert(numExecutorsRunning <= targetNumExecutors)
+  executorIdCounter += 1
   val executorHostname = container.getNodeId.getHost
   val containerId = container.getId
-  executorIdCounter += 1
   val executorId = executorIdCounter.toString
-
   assert(container.getResource.getMemory >= resource.getMemory)
-
   logInfo("Launching container %s for on host %s".format(containerId, 
executorHostname))
-  executorIdToContainer(executorId) = container
-  containerIdToExecutorId(container.getId) = executorId
-
-  val containerSet = 
allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
-new HashSet[ContainerId])
-
-  containerSet += containerId
-  allocatedContainerToHostMap.put(containerId, executorHostname)
-
-  val executorRunnable = new ExecutorRunnable(
-container,
-conf,
-sparkConf,
-driverUrl,
-executorId,
-executorHostname,
-executorMemory,
-executorCores,
-appAttemptId.getApplicationId.toString,
-securityMgr,
-localResources)
+
+  def updateInternalState(): Unit = synchronized {
+numExecutorsRunning += 1
+assert(numExecutorsRunning <= targetNumExecutors)
+executorIdToContainer(executorId) = container
+containerIdToExecutorId(container.getId) = executorId
+
+val containerSet = 
allocatedHostToContainersMap.getOrElseU

spark git commit: [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master aa0364510 -> 83070cd1d


[SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.

Description from JIRA.
In ReplSuite, for a test that can be tested well on just local should not 
really have to start a local-cluster. And similarly a test is in-sufficiently 
run if it's actually fixing a problem related to a distributed run in 
environment with local run.

Existing tests.

Author: Prashant Sharma 

Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83070cd1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83070cd1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83070cd1

Branch: refs/heads/master
Commit: 83070cd1d459101e1189f3b07ea59e22f98e84ce
Parents: aa03645
Author: Prashant Sharma 
Authored: Thu Jun 9 17:45:37 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Jun 9 17:45:42 2016 -0700

--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/83070cd1/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 19f201f..26b8600 100644
--- a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -233,7 +233,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-1199 two instances of same class don't type check.") {
-val output = runInterpreter("local-cluster[1,1,1024]",
+val output = runInterpreter("local",
   """
 |case class Sum(exp: String, exp2: String)
 |val a = Sum("A", "B")
@@ -305,7 +305,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-2632 importing a method from non serializable class and not 
using it.") {
-val output = runInterpreter("local",
+val output = runInterpreter("local-cluster[1,1,1024]",
 """
   |class TestClass() { def testMethod = 3 }
   |val t = new TestClass

http://git-wip-us.apache.org/repos/asf/spark/blob/83070cd1/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 48582c1..2444e93 100644
--- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -276,7 +276,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-1199 two instances of same class don't type check.") {
-val output = runInterpreter("local-cluster[1,1,1024]",
+val output = runInterpreter("local",
   """
 |case class Sum(exp: String, exp2: String)
 |val a = Sum("A", "B")
@@ -336,7 +336,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-2632 importing a method from non serializable class and not 
using it.") {
-val output = runInterpreter("local",
+val output = runInterpreter("local-cluster[1,1,1024]",
   """
   |class TestClass() { def testMethod = 3 }
   |val t = new TestClass


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b2d076c35 -> 3119d8eef


[SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.

Description from JIRA.
In ReplSuite, for a test that can be tested well on just local should not 
really have to start a local-cluster. And similarly a test is in-sufficiently 
run if it's actually fixing a problem related to a distributed run in 
environment with local run.

Existing tests.

Author: Prashant Sharma 

Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix.

(cherry picked from commit 83070cd1d459101e1189f3b07ea59e22f98e84ce)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3119d8ee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3119d8ee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3119d8ee

Branch: refs/heads/branch-2.0
Commit: 3119d8eef6dc2d0805c87989746cd79882b9cfea
Parents: b2d076c
Author: Prashant Sharma 
Authored: Thu Jun 9 17:45:37 2016 -0700
Committer: Shixiong Zhu 
Committed: Thu Jun 9 17:46:28 2016 -0700

--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++--
 .../src/test/scala/org/apache/spark/repl/ReplSuite.scala | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3119d8ee/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 19f201f..26b8600 100644
--- a/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.10/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -233,7 +233,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-1199 two instances of same class don't type check.") {
-val output = runInterpreter("local-cluster[1,1,1024]",
+val output = runInterpreter("local",
   """
 |case class Sum(exp: String, exp2: String)
 |val a = Sum("A", "B")
@@ -305,7 +305,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-2632 importing a method from non serializable class and not 
using it.") {
-val output = runInterpreter("local",
+val output = runInterpreter("local-cluster[1,1,1024]",
 """
   |class TestClass() { def testMethod = 3 }
   |val t = new TestClass

http://git-wip-us.apache.org/repos/asf/spark/blob/3119d8ee/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
--
diff --git 
a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala 
b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
index 48582c1..2444e93 100644
--- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
+++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala
@@ -276,7 +276,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-1199 two instances of same class don't type check.") {
-val output = runInterpreter("local-cluster[1,1,1024]",
+val output = runInterpreter("local",
   """
 |case class Sum(exp: String, exp2: String)
 |val a = Sum("A", "B")
@@ -336,7 +336,7 @@ class ReplSuite extends SparkFunSuite {
   }
 
   test("SPARK-2632 importing a method from non serializable class and not 
using it.") {
-val output = runInterpreter("local",
+val output = runInterpreter("local-cluster[1,1,1024]",
   """
   |class TestClass() { def testMethod = 3 }
   |val t = new TestClass


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15794] Should truncate toString() of very wide plans

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3119d8eef -> 00bbf7873


[SPARK-15794] Should truncate toString() of very wide plans

## What changes were proposed in this pull request?

With very wide tables, e.g. thousands of fields, the plan output is unreadable 
and often causes OOMs due to inefficient string processing. This truncates all 
struct and operator field lists to a user configurable threshold to limit 
performance impact.

It would also be nice to optimize string generation to avoid these sort of 
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including 
expressions), but this is probably too large of a change for 2.0 at this point, 
and truncation has other benefits for usability.

## How was this patch tested?

Added a microbenchmark that covers this case particularly well. I also ran the 
microbenchmark while varying the truncation threshold.

```
numFields = 5
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)2336 / 2558  0.0   
23364.4   0.1X

numFields = 25
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)4237 / 4465  0.0   
42367.9   0.1X

numFields = 100
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)  10458 / 11223  0.0  
104582.0   0.0X

numFields = Infinity
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

[info]   java.lang.OutOfMemoryError: Java heap space
```

Author: Eric Liang 
Author: Eric Liang 

Closes #13537 from ericl/truncated-string.

(cherry picked from commit b914e1930fd5c5f2808f92d4958ec6fbeddf2e30)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00bbf787
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00bbf787
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00bbf787

Branch: refs/heads/branch-2.0
Commit: 00bbf787340e208cc76230ffd96026c1305f942c
Parents: 3119d8e
Author: Eric Liang 
Authored: Thu Jun 9 18:05:16 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 18:05:30 2016 -0700

--
 .../scala/org/apache/spark/util/Utils.scala | 47 +++
 .../org/apache/spark/util/UtilsSuite.scala  |  8 
 .../sql/catalyst/expressions/Expression.scala   |  4 +-
 .../spark/sql/catalyst/trees/TreeNode.scala |  6 +--
 .../org/apache/spark/sql/types/StructType.scala |  7 +--
 .../spark/sql/execution/ExistingRDD.scala   | 10 ++--
 .../spark/sql/execution/QueryExecution.scala|  5 +-
 .../execution/aggregate/HashAggregateExec.scala |  7 +--
 .../execution/aggregate/SortAggregateExec.scala |  7 +--
 .../execution/datasources/LogicalRelation.scala |  3 +-
 .../org/apache/spark/sql/execution/limit.scala  |  5 +-
 .../execution/streaming/StreamExecution.scala   |  3 +-
 .../spark/sql/execution/streaming/memory.scala  |  3 +-
 .../benchmark/WideSchemaBenchmark.scala | 49 
 14 files changed, 140 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/00bbf787/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 1a9dbca..f9d0540 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -26,6 +26,7 @@ import java.nio.charset.StandardCharsets
 import java.nio.file.Files
 import java.util.{Locale, Properties, Random, UUID}
 import java.util.concurrent._
+import java.util.concurrent.atomic.AtomicBoolean
 import javax.net.ssl.HttpsURLConnection
 
 import scala.annotation.tailrec
@@ -78,6 +79,52 @@ private[spark] object Utils extends Logging {
   private val MAX_DIR_CREATION_ATTEMPTS: Int = 10
   @volatile private var localRootDirs: Array[String] = null
 
+  /**
+   * The performance overhead of creating and logging strings for wide schemas 
can be large. To
+   * limit the impact, we bound the number of fields to include by default. 
This can be overriden
+   * by setting the 'spark.debug.maxToStringFields' co

spark git commit: [SPARK-15794] Should truncate toString() of very wide plans

2016-06-09 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master 83070cd1d -> b914e1930


[SPARK-15794] Should truncate toString() of very wide plans

## What changes were proposed in this pull request?

With very wide tables, e.g. thousands of fields, the plan output is unreadable 
and often causes OOMs due to inefficient string processing. This truncates all 
struct and operator field lists to a user configurable threshold to limit 
performance impact.

It would also be nice to optimize string generation to avoid these sort of 
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including 
expressions), but this is probably too large of a change for 2.0 at this point, 
and truncation has other benefits for usability.

## How was this patch tested?

Added a microbenchmark that covers this case particularly well. I also ran the 
microbenchmark while varying the truncation threshold.

```
numFields = 5
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)2336 / 2558  0.0   
23364.4   0.1X

numFields = 25
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)4237 / 4465  0.0   
42367.9   0.1X

numFields = 100
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

2000 wide x 50 rows (write in-mem)  10458 / 11223  0.0  
104582.0   0.0X

numFields = Infinity
wide shallowly nested struct field r/w:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

[info]   java.lang.OutOfMemoryError: Java heap space
```

Author: Eric Liang 
Author: Eric Liang 

Closes #13537 from ericl/truncated-string.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b914e193
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b914e193
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b914e193

Branch: refs/heads/master
Commit: b914e1930fd5c5f2808f92d4958ec6fbeddf2e30
Parents: 83070cd
Author: Eric Liang 
Authored: Thu Jun 9 18:05:16 2016 -0700
Committer: Josh Rosen 
Committed: Thu Jun 9 18:05:16 2016 -0700

--
 .../scala/org/apache/spark/util/Utils.scala | 47 +++
 .../org/apache/spark/util/UtilsSuite.scala  |  8 
 .../sql/catalyst/expressions/Expression.scala   |  4 +-
 .../spark/sql/catalyst/trees/TreeNode.scala |  6 +--
 .../org/apache/spark/sql/types/StructType.scala |  7 +--
 .../spark/sql/execution/ExistingRDD.scala   | 10 ++--
 .../spark/sql/execution/QueryExecution.scala|  5 +-
 .../execution/aggregate/HashAggregateExec.scala |  7 +--
 .../execution/aggregate/SortAggregateExec.scala |  7 +--
 .../execution/datasources/LogicalRelation.scala |  3 +-
 .../org/apache/spark/sql/execution/limit.scala  |  5 +-
 .../execution/streaming/StreamExecution.scala   |  3 +-
 .../spark/sql/execution/streaming/memory.scala  |  3 +-
 .../benchmark/WideSchemaBenchmark.scala | 49 
 14 files changed, 140 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b914e193/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 1a9dbca..f9d0540 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -26,6 +26,7 @@ import java.nio.charset.StandardCharsets
 import java.nio.file.Files
 import java.util.{Locale, Properties, Random, UUID}
 import java.util.concurrent._
+import java.util.concurrent.atomic.AtomicBoolean
 import javax.net.ssl.HttpsURLConnection
 
 import scala.annotation.tailrec
@@ -78,6 +79,52 @@ private[spark] object Utils extends Logging {
   private val MAX_DIR_CREATION_ATTEMPTS: Int = 10
   @volatile private var localRootDirs: Array[String] = null
 
+  /**
+   * The performance overhead of creating and logging strings for wide schemas 
can be large. To
+   * limit the impact, we bound the number of fields to include by default. 
This can be overriden
+   * by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.
+   */
+  val DEFAULT_MAX_TO_STRING_FIELDS = 25
+
+  private def maxNumToStringFields = {

spark git commit: [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream

2016-06-09 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master b914e1930 -> 4d9d9cc58


[SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream

## What changes were proposed in this pull request?

This PR closes the input stream created in `HDFSMetadataLog.get`

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #13583 from zsxwing/leak.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d9d9cc5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d9d9cc5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d9d9cc5

Branch: refs/heads/master
Commit: 4d9d9cc5853c467acdb67915117127915a98d8f8
Parents: b914e19
Author: Shixiong Zhu 
Authored: Thu Jun 9 18:45:19 2016 -0700
Committer: Tathagata Das 
Committed: Thu Jun 9 18:45:19 2016 -0700

--
 .../spark/sql/execution/streaming/HDFSMetadataLog.scala  | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4d9d9cc5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index fca3d51..069e41b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -175,8 +175,12 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
 val batchMetadataFile = batchIdToPath(batchId)
 if (fileManager.exists(batchMetadataFile)) {
   val input = fileManager.open(batchMetadataFile)
-  val bytes = IOUtils.toByteArray(input)
-  Some(deserialize(bytes))
+  try {
+val bytes = IOUtils.toByteArray(input)
+Some(deserialize(bytes))
+  } finally {
+input.close()
+  }
 } else {
   logDebug(s"Unable to find batch $batchMetadataFile")
   None


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream

2016-06-09 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 00bbf7873 -> ca0801120


[SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream

## What changes were proposed in this pull request?

This PR closes the input stream created in `HDFSMetadataLog.get`

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #13583 from zsxwing/leak.

(cherry picked from commit 4d9d9cc5853c467acdb67915117127915a98d8f8)
Signed-off-by: Tathagata Das 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca080112
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca080112
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca080112

Branch: refs/heads/branch-2.0
Commit: ca0801120b3c650603b98b7838e86fee45f8999f
Parents: 00bbf78
Author: Shixiong Zhu 
Authored: Thu Jun 9 18:45:19 2016 -0700
Committer: Tathagata Das 
Committed: Thu Jun 9 18:45:39 2016 -0700

--
 .../spark/sql/execution/streaming/HDFSMetadataLog.scala  | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca080112/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index fca3d51..069e41b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -175,8 +175,12 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
 val batchMetadataFile = batchIdToPath(batchId)
 if (fileManager.exists(batchMetadataFile)) {
   val input = fileManager.open(batchMetadataFile)
-  val bytes = IOUtils.toByteArray(input)
-  Some(deserialize(bytes))
+  try {
+val bytes = IOUtils.toByteArray(input)
+Some(deserialize(bytes))
+  } finally {
+input.close()
+  }
 } else {
   logDebug(s"Unable to find batch $batchMetadataFile")
   None


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15850][SQL] Remove function grouping in SparkSession

2016-06-09 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/master 4d9d9cc58 -> 16df133d7


[SPARK-15850][SQL] Remove function grouping in SparkSession

## What changes were proposed in this pull request?
SparkSession does not have that many functions due to better namespacing, and 
as a result we probably don't need the function grouping. This patch removes 
the grouping and also adds missing scaladocs for createDataset functions in 
SQLContext.

Closes #13577.

## How was this patch tested?
N/A - this is a documentation change.

Author: Reynold Xin 

Closes #13582 from rxin/SPARK-15850.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16df133d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16df133d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16df133d

Branch: refs/heads/master
Commit: 16df133d7f5f3115cd5baa696fa73a4694f9cba9
Parents: 4d9d9cc
Author: Reynold Xin 
Authored: Thu Jun 9 18:58:24 2016 -0700
Committer: Herman van Hovell 
Committed: Thu Jun 9 18:58:24 2016 -0700

--
 .../scala/org/apache/spark/sql/SQLContext.scala | 62 +++-
 .../org/apache/spark/sql/SparkSession.scala | 28 -
 .../scala/org/apache/spark/sql/functions.scala  |  2 +-
 3 files changed, 61 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16df133d/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
index 0fb2400..23f2b6e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@@ -51,7 +51,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager
  * @groupname specificdata Specific Data Sources
  * @groupname config Configuration
  * @groupname dataframes Custom DataFrame Creation
- * @groupname dataset Custom DataFrame Creation
+ * @groupname dataset Custom Dataset Creation
  * @groupname Ungrouped Support functions for language integrated queries
  * @since 1.0.0
  */
@@ -346,15 +346,73 @@ class SQLContext private[sql](val sparkSession: 
SparkSession)
 sparkSession.createDataFrame(rowRDD, schema, needsConversion)
   }
 
-
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from a local Seq of data of a given type. This 
method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * == Example ==
+   *
+   * {{{
+   *
+   *   import spark.implicits._
+   *   case class Person(name: String, age: Long)
+   *   val data = Seq(Person("Michael", 29), Person("Andy", 30), 
Person("Justin", 19))
+   *   val ds = spark.createDataset(data)
+   *
+   *   ds.show()
+   *   // +---+---+
+   *   // |   name|age|
+   *   // +---+---+
+   *   // |Michael| 29|
+   *   // |   Andy| 30|
+   *   // | Justin| 19|
+   *   // +---+---+
+   * }}}
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: Seq[T]): Dataset[T] = {
 sparkSession.createDataset(data)
   }
 
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from an RDD of a given type. This method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = {
 sparkSession.createDataset(data)
   }
 
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from a [[java.util.List]] of a given type. This 
method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * == Java Example ==
+   *
+   * {{{
+   * List data = Arrays.asList("hello", "world");
+   * Dataset ds = spark.createDataset(data, Encoders.STRING());
+   * }}}
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = {
 sparkSession.createDataset(data)
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/16df133d/sql/core/src/main/scala/org/apache/

spark git commit: [SPARK-15850][SQL] Remove function grouping in SparkSession

2016-06-09 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ca0801120 -> d45aa50fc


[SPARK-15850][SQL] Remove function grouping in SparkSession

## What changes were proposed in this pull request?
SparkSession does not have that many functions due to better namespacing, and 
as a result we probably don't need the function grouping. This patch removes 
the grouping and also adds missing scaladocs for createDataset functions in 
SQLContext.

Closes #13577.

## How was this patch tested?
N/A - this is a documentation change.

Author: Reynold Xin 

Closes #13582 from rxin/SPARK-15850.

(cherry picked from commit 16df133d7f5f3115cd5baa696fa73a4694f9cba9)
Signed-off-by: Herman van Hovell 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d45aa50f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d45aa50f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d45aa50f

Branch: refs/heads/branch-2.0
Commit: d45aa50fc7e41f2a8a659b2f512a846342a854dd
Parents: ca08011
Author: Reynold Xin 
Authored: Thu Jun 9 18:58:24 2016 -0700
Committer: Herman van Hovell 
Committed: Thu Jun 9 18:58:37 2016 -0700

--
 .../scala/org/apache/spark/sql/SQLContext.scala | 62 +++-
 .../org/apache/spark/sql/SparkSession.scala | 28 -
 .../scala/org/apache/spark/sql/functions.scala  |  2 +-
 3 files changed, 61 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d45aa50f/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
index 0fb2400..23f2b6e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@@ -51,7 +51,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager
  * @groupname specificdata Specific Data Sources
  * @groupname config Configuration
  * @groupname dataframes Custom DataFrame Creation
- * @groupname dataset Custom DataFrame Creation
+ * @groupname dataset Custom Dataset Creation
  * @groupname Ungrouped Support functions for language integrated queries
  * @since 1.0.0
  */
@@ -346,15 +346,73 @@ class SQLContext private[sql](val sparkSession: 
SparkSession)
 sparkSession.createDataFrame(rowRDD, schema, needsConversion)
   }
 
-
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from a local Seq of data of a given type. This 
method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * == Example ==
+   *
+   * {{{
+   *
+   *   import spark.implicits._
+   *   case class Person(name: String, age: Long)
+   *   val data = Seq(Person("Michael", 29), Person("Andy", 30), 
Person("Justin", 19))
+   *   val ds = spark.createDataset(data)
+   *
+   *   ds.show()
+   *   // +---+---+
+   *   // |   name|age|
+   *   // +---+---+
+   *   // |Michael| 29|
+   *   // |   Andy| 30|
+   *   // | Justin| 19|
+   *   // +---+---+
+   * }}}
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: Seq[T]): Dataset[T] = {
 sparkSession.createDataset(data)
   }
 
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from an RDD of a given type. This method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = {
 sparkSession.createDataset(data)
   }
 
+  /**
+   * :: Experimental ::
+   * Creates a [[Dataset]] from a [[java.util.List]] of a given type. This 
method requires an
+   * encoder (to convert a JVM object of type `T` to and from the internal 
Spark SQL representation)
+   * that is generally created automatically through implicits from a 
`SparkSession`, or can be
+   * created explicitly by calling static methods on [[Encoders]].
+   *
+   * == Java Example ==
+   *
+   * {{{
+   * List data = Arrays.asList("hello", "world");
+   * Dataset ds = spark.createDataset(data, Encoders.STRING());
+   * }}}
+   *
+   * @since 2.0.0
+   * @group dataset
+   */
+  @Experimental
   def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = {
 sparkSession.createDa

spark git commit: [SPARK-15791] Fix NPE in ScalarSubquery

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d45aa50fc -> ebbbf2136


[SPARK-15791] Fix NPE in ScalarSubquery

## What changes were proposed in this pull request?

The fix is pretty simple, just don't make the executedPlan transient in 
`ScalarSubquery` since it is referenced at execution time.

## How was this patch tested?

I verified the fix manually in non-local mode. It's not clear to me why the 
problem did not manifest in local mode, any suggestions?

cc davies

Author: Eric Liang 

Closes #13569 from ericl/fix-scalar-npe.

(cherry picked from commit 6c5fd977fbcb821a57cb4a13bc3d413a695fbc32)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebbbf213
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebbbf213
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebbbf213

Branch: refs/heads/branch-2.0
Commit: ebbbf2136412c628421b88a2bc83091a2b473c55
Parents: d45aa50
Author: Eric Liang 
Authored: Thu Jun 9 22:28:31 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 22:28:36 2016 -0700

--
 .../scala/org/apache/spark/sql/execution/subquery.scala   |  2 +-
 .../src/test/scala/org/apache/spark/sql/QueryTest.scala   | 10 --
 .../test/scala/org/apache/spark/sql/SQLQuerySuite.scala   |  3 ++-
 .../test/scala/org/apache/spark/sql/SubquerySuite.scala   |  4 
 4 files changed, 15 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
index 4a1f12d..461d301 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
@@ -32,7 +32,7 @@ import org.apache.spark.sql.types.DataType
  * This is the physical copy of ScalarSubquery to be used inside SparkPlan.
  */
 case class ScalarSubquery(
-@transient executedPlan: SparkPlan,
+executedPlan: SparkPlan,
 exprId: ExprId)
   extends SubqueryExpression {
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
index 9c044f4..acb59d4 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
@@ -341,10 +341,16 @@ object QueryTest {
*
* @param df the [[DataFrame]] to be executed
* @param expectedAnswer the expected result in a [[Seq]] of [[Row]]s.
+   * @param checkToRDD whether to verify deserialization to an RDD. This runs 
the query twice.
*/
-  def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]): Option[String] = {
+  def checkAnswer(
+  df: DataFrame,
+  expectedAnswer: Seq[Row],
+  checkToRDD: Boolean = true): Option[String] = {
 val isSorted = df.logicalPlan.collect { case s: logical.Sort => s 
}.nonEmpty
-
+if (checkToRDD) {
+  df.rdd.count()  // Also attempt to deserialize as an RDD [SPARK-15791]
+}
 
 val sparkAnswer = try df.collect().toSeq catch {
   case e: Exception =>

http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 8284e8d..90465b6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2118,7 +2118,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   // is correct.
   def verifyCallCount(df: DataFrame, expectedResult: Row, expectedCount: 
Int): Unit = {
 countAcc.setValue(0)
-checkAnswer(df, expectedResult)
+QueryTest.checkAnswer(
+  df, Seq(expectedResult), checkToRDD = false /* avoid duplicate exec 
*/)
 assert(countAcc.value == expectedCount)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ebbbf213/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scal

spark git commit: [SPARK-15791] Fix NPE in ScalarSubquery

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 16df133d7 -> 6c5fd977f


[SPARK-15791] Fix NPE in ScalarSubquery

## What changes were proposed in this pull request?

The fix is pretty simple, just don't make the executedPlan transient in 
`ScalarSubquery` since it is referenced at execution time.

## How was this patch tested?

I verified the fix manually in non-local mode. It's not clear to me why the 
problem did not manifest in local mode, any suggestions?

cc davies

Author: Eric Liang 

Closes #13569 from ericl/fix-scalar-npe.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c5fd977
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c5fd977
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c5fd977

Branch: refs/heads/master
Commit: 6c5fd977fbcb821a57cb4a13bc3d413a695fbc32
Parents: 16df133
Author: Eric Liang 
Authored: Thu Jun 9 22:28:31 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 22:28:31 2016 -0700

--
 .../scala/org/apache/spark/sql/execution/subquery.scala   |  2 +-
 .../src/test/scala/org/apache/spark/sql/QueryTest.scala   | 10 --
 .../test/scala/org/apache/spark/sql/SQLQuerySuite.scala   |  3 ++-
 .../test/scala/org/apache/spark/sql/SubquerySuite.scala   |  4 
 4 files changed, 15 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
index 4a1f12d..461d301 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
@@ -32,7 +32,7 @@ import org.apache.spark.sql.types.DataType
  * This is the physical copy of ScalarSubquery to be used inside SparkPlan.
  */
 case class ScalarSubquery(
-@transient executedPlan: SparkPlan,
+executedPlan: SparkPlan,
 exprId: ExprId)
   extends SubqueryExpression {
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
index 9c044f4..acb59d4 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
@@ -341,10 +341,16 @@ object QueryTest {
*
* @param df the [[DataFrame]] to be executed
* @param expectedAnswer the expected result in a [[Seq]] of [[Row]]s.
+   * @param checkToRDD whether to verify deserialization to an RDD. This runs 
the query twice.
*/
-  def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]): Option[String] = {
+  def checkAnswer(
+  df: DataFrame,
+  expectedAnswer: Seq[Row],
+  checkToRDD: Boolean = true): Option[String] = {
 val isSorted = df.logicalPlan.collect { case s: logical.Sort => s 
}.nonEmpty
-
+if (checkToRDD) {
+  df.rdd.count()  // Also attempt to deserialize as an RDD [SPARK-15791]
+}
 
 val sparkAnswer = try df.collect().toSeq catch {
   case e: Exception =>

http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 8284e8d..90465b6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2118,7 +2118,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   // is correct.
   def verifyCallCount(df: DataFrame, expectedResult: Row, expectedCount: 
Int): Unit = {
 countAcc.setValue(0)
-checkAnswer(df, expectedResult)
+QueryTest.checkAnswer(
+  df, Seq(expectedResult), checkToRDD = false /* avoid duplicate exec 
*/)
 assert(countAcc.value == expectedCount)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6c5fd977/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
index a932125..05491a4 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
+++

spark git commit: [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ebbbf2136 -> 1371d5ece


[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order

## What changes were proposed in this pull request?

Currently, `crosstab` returns a Dataframe having **random-order** columns 
obtained by just `distinct`. Also, the documentation of `crosstab` shows the 
result in a sorted order which is different from the current implementation. 
This PR explicitly constructs the columns in a sorted order in order to improve 
user experience. Also, this implementation gives the same result with the 
documentation.

**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+-+---+---+---+
|key_value|  3|  2|  1|
+-+---+---+---+
|2|  1|  0|  2|
|1|  0|  1|  1|
|3|  1|  1|  0|
+-+---+---+---+

scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+-+---+---+---+
|key_value|  c|  a|  b|
+-+---+---+---+
|2|  1|  2|  0|
|1|  0|  1|  1|
|3|  1|  0|  1|
+-+---+---+---+
```

**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+-+---+---+---+
|key_value|  1|  2|  3|
+-+---+---+---+
|2|  2|  0|  1|
|1|  1|  1|  0|
|3|  0|  1|  1|
+-+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+-+---+---+---+
|key_value|  a|  b|  c|
+-+---+---+---+
|2|  2|  0|  1|
|1|  1|  1|  0|
|3|  0|  1|  1|
+-+---+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with updated testcases.

Author: Dongjoon Hyun 

Closes #13436 from dongjoon-hyun/SPARK-15696.

(cherry picked from commit 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1371d5ec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1371d5ec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1371d5ec

Branch: refs/heads/branch-2.0
Commit: 1371d5ecedb76d138dc9f431e5b40e36a58ed9ca
Parents: ebbbf21
Author: Dongjoon Hyun 
Authored: Thu Jun 9 22:46:51 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 22:46:58 2016 -0700

--
 .../org/apache/spark/sql/execution/stat/StatFunctions.scala  | 4 ++--
 .../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9c04061..ea58df7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging {
 def cleanElement(element: Any): String = {
   if (element == null) "null" else element.toString
 }
-// get the distinct values of column 2, so that we can make them the 
column names
+// get the distinct sorted values of column 2, so that we can make them 
the column names
 val distinctCol2: Map[Any, Int] =
-  counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
+  counts.map(e => 
cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap
 val columnSize = distinctCol2.size
 require(columnSize < 1e4, s"The number of distinct values for $col2, can't 
" +
   s"exceed 1e4. Currently $columnSize")

http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
index 1e8f106..0152f3f 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
@@ -246,8 +246,8 @@ public class JavaDataFrameSuite {
 Dataset crosstab = d

spark git commit: [SPARK-15696][SQL] Improve `crosstab` to have a consistent column order

2016-06-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 6c5fd977f -> 5a3533e77


[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order

## What changes were proposed in this pull request?

Currently, `crosstab` returns a Dataframe having **random-order** columns 
obtained by just `distinct`. Also, the documentation of `crosstab` shows the 
result in a sorted order which is different from the current implementation. 
This PR explicitly constructs the columns in a sorted order in order to improve 
user experience. Also, this implementation gives the same result with the 
documentation.

**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+-+---+---+---+
|key_value|  3|  2|  1|
+-+---+---+---+
|2|  1|  0|  2|
|1|  0|  1|  1|
|3|  1|  1|  0|
+-+---+---+---+

scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+-+---+---+---+
|key_value|  c|  a|  b|
+-+---+---+---+
|2|  1|  2|  0|
|1|  0|  1|  1|
|3|  1|  0|  1|
+-+---+---+---+
```

**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+-+---+---+---+
|key_value|  1|  2|  3|
+-+---+---+---+
|2|  2|  0|  1|
|1|  1|  1|  0|
|3|  0|  1|  1|
+-+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+-+---+---+---+
|key_value|  a|  b|  c|
+-+---+---+---+
|2|  2|  0|  1|
|1|  1|  1|  0|
|3|  0|  1|  1|
+-+---+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with updated testcases.

Author: Dongjoon Hyun 

Closes #13436 from dongjoon-hyun/SPARK-15696.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5a3533e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5a3533e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5a3533e7

Branch: refs/heads/master
Commit: 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a
Parents: 6c5fd97
Author: Dongjoon Hyun 
Authored: Thu Jun 9 22:46:51 2016 -0700
Committer: Reynold Xin 
Committed: Thu Jun 9 22:46:51 2016 -0700

--
 .../org/apache/spark/sql/execution/stat/StatFunctions.scala  | 4 ++--
 .../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5a3533e7/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9c04061..ea58df7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging {
 def cleanElement(element: Any): String = {
   if (element == null) "null" else element.toString
 }
-// get the distinct values of column 2, so that we can make them the 
column names
+// get the distinct sorted values of column 2, so that we can make them 
the column names
 val distinctCol2: Map[Any, Int] =
-  counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
+  counts.map(e => 
cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap
 val columnSize = distinctCol2.size
 require(columnSize < 1e4, s"The number of distinct values for $col2, can't 
" +
   s"exceed 1e4. Currently $columnSize")

http://git-wip-us.apache.org/repos/asf/spark/blob/5a3533e7/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
index 1e8f106..0152f3f 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
@@ -246,8 +246,8 @@ public class JavaDataFrameSuite {
 Dataset crosstab = df.stat().crosstab("a", "b");
 String[] columnNames = crosstab.schema().fieldNames();
 Assert.asser