date:20160620

spark git commit: [SPARKR][DOCS] R code doc cleanup

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4e193d3da -> 38f3b76bd


[SPARKR][DOCS] R code doc cleanup

## What changes were proposed in this pull request?

I ran a full pass from A to Z and fixed the obvious duplications, improper 
grouping etc.

There are still more doc issues to be cleaned up.

## How was this patch tested?

manual tests

Author: Felix Cheung 

Closes #13798 from felixcheung/rdocseealso.

(cherry picked from commit 09f4ceaeb0a99874f774e09d868fdf907ecf256f)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38f3b76b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38f3b76b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38f3b76b

Branch: refs/heads/branch-2.0
Commit: 38f3b76bd6b4a3e4d20048beeb92275ebf93c8d8
Parents: 4e193d3
Author: Felix Cheung 
Authored: Mon Jun 20 23:51:08 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 23:51:20 2016 -0700

--
 R/pkg/R/DataFrame.R  | 39 ++-
 R/pkg/R/SQLContext.R |  6 +++---
 R/pkg/R/column.R |  6 ++
 R/pkg/R/context.R|  5 +++--
 R/pkg/R/functions.R  | 40 +---
 R/pkg/R/generics.R   | 44 ++--
 R/pkg/R/mllib.R  |  6 --
 R/pkg/R/sparkR.R |  8 +---
 8 files changed, 70 insertions(+), 84 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/38f3b76b/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index b3f2dd8..a8ade1a 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -463,6 +463,7 @@ setMethod("createOrReplaceTempView",
   })
 
 #' (Deprecated) Register Temporary Table
+#'
 #' Registers a SparkDataFrame as a Temporary Table in the SQLContext
 #' @param x A SparkDataFrame
 #' @param tableName A character vector containing the name of the table
@@ -606,10 +607,10 @@ setMethod("unpersist",
 #'
 #' The following options for repartition are possible:
 #' \itemize{
-#'  \item{"Option 1"} {Return a new SparkDataFrame partitioned by
+#'  \item{1.} {Return a new SparkDataFrame partitioned by
 #'  the given columns into `numPartitions`.}
-#'  \item{"Option 2"} {Return a new SparkDataFrame that has exactly 
`numPartitions`.}
-#'  \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given 
column(s),
+#'  \item{2.} {Return a new SparkDataFrame that has exactly `numPartitions`.}
+#'  \item{3.} {Return a new SparkDataFrame partitioned by the given column(s),
 #'  using `spark.sql.shuffle.partitions` as number of 
partitions.}
 #'}
 #' @param x A SparkDataFrame
@@ -1053,7 +1054,7 @@ setMethod("limit",
 dataFrame(res)
   })
 
-#' Take the first NUM rows of a SparkDataFrame and return a the results as a 
data.frame
+#' Take the first NUM rows of a SparkDataFrame and return a the results as a R 
data.frame
 #'
 #' @family SparkDataFrame functions
 #' @rdname take
@@ -1076,7 +1077,7 @@ setMethod("take",
 
 #' Head
 #'
-#' Return the first NUM rows of a SparkDataFrame as a data.frame. If NUM is 
NULL,
+#' Return the first NUM rows of a SparkDataFrame as a R data.frame. If NUM is 
NULL,
 #' then head() returns the first 6 rows in keeping with the current data.frame
 #' convention in R.
 #'
@@ -1157,7 +1158,6 @@ setMethod("toRDD",
 #'
 #' @param x a SparkDataFrame
 #' @return a GroupedData
-#' @seealso GroupedData
 #' @family SparkDataFrame functions
 #' @rdname groupBy
 #' @name groupBy
@@ -1242,9 +1242,9 @@ dapplyInternal <- function(x, func, schema) {
 #'
 #' @param x A SparkDataFrame
 #' @param func A function to be applied to each partition of the 
SparkDataFrame.
-#' func should have only one parameter, to which a data.frame 
corresponds
+#' func should have only one parameter, to which a R data.frame 
corresponds
 #' to each partition will be passed.
-#' The output of func should be a data.frame.
+#' The output of func should be a R data.frame.
 #' @param schema The schema of the resulting SparkDataFrame after the function 
is applied.
 #'   It must match the output of func.
 #' @family SparkDataFrame functions
@@ -1291,9 +1291,9 @@ setMethod("dapply",
 #'
 #' @param x A SparkDataFrame
 #' @param func A function to be applied to each partition of the 
SparkDataFrame.
-#' func should have only one parameter, to which a data.frame 
corresponds
+#' func should have only one parameter, to which a R data.frame 
corresponds
 #' to each partition will be passed.
-#' The output of func should be a data.frame.
+#' The output

spark git commit: [SPARKR][DOCS] R code doc cleanup

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 41e0ffb19 -> 09f4ceaeb


[SPARKR][DOCS] R code doc cleanup

## What changes were proposed in this pull request?

I ran a full pass from A to Z and fixed the obvious duplications, improper 
grouping etc.

There are still more doc issues to be cleaned up.

## How was this patch tested?

manual tests

Author: Felix Cheung 

Closes #13798 from felixcheung/rdocseealso.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/09f4ceae
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/09f4ceae
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/09f4ceae

Branch: refs/heads/master
Commit: 09f4ceaeb0a99874f774e09d868fdf907ecf256f
Parents: 41e0ffb
Author: Felix Cheung 
Authored: Mon Jun 20 23:51:08 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 23:51:08 2016 -0700

--
 R/pkg/R/DataFrame.R  | 39 ++-
 R/pkg/R/SQLContext.R |  6 +++---
 R/pkg/R/column.R |  6 ++
 R/pkg/R/context.R|  5 +++--
 R/pkg/R/functions.R  | 40 +---
 R/pkg/R/generics.R   | 44 ++--
 R/pkg/R/mllib.R  |  6 --
 R/pkg/R/sparkR.R |  8 +---
 8 files changed, 70 insertions(+), 84 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/09f4ceae/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index b3f2dd8..a8ade1a 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -463,6 +463,7 @@ setMethod("createOrReplaceTempView",
   })
 
 #' (Deprecated) Register Temporary Table
+#'
 #' Registers a SparkDataFrame as a Temporary Table in the SQLContext
 #' @param x A SparkDataFrame
 #' @param tableName A character vector containing the name of the table
@@ -606,10 +607,10 @@ setMethod("unpersist",
 #'
 #' The following options for repartition are possible:
 #' \itemize{
-#'  \item{"Option 1"} {Return a new SparkDataFrame partitioned by
+#'  \item{1.} {Return a new SparkDataFrame partitioned by
 #'  the given columns into `numPartitions`.}
-#'  \item{"Option 2"} {Return a new SparkDataFrame that has exactly 
`numPartitions`.}
-#'  \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given 
column(s),
+#'  \item{2.} {Return a new SparkDataFrame that has exactly `numPartitions`.}
+#'  \item{3.} {Return a new SparkDataFrame partitioned by the given column(s),
 #'  using `spark.sql.shuffle.partitions` as number of 
partitions.}
 #'}
 #' @param x A SparkDataFrame
@@ -1053,7 +1054,7 @@ setMethod("limit",
 dataFrame(res)
   })
 
-#' Take the first NUM rows of a SparkDataFrame and return a the results as a 
data.frame
+#' Take the first NUM rows of a SparkDataFrame and return a the results as a R 
data.frame
 #'
 #' @family SparkDataFrame functions
 #' @rdname take
@@ -1076,7 +1077,7 @@ setMethod("take",
 
 #' Head
 #'
-#' Return the first NUM rows of a SparkDataFrame as a data.frame. If NUM is 
NULL,
+#' Return the first NUM rows of a SparkDataFrame as a R data.frame. If NUM is 
NULL,
 #' then head() returns the first 6 rows in keeping with the current data.frame
 #' convention in R.
 #'
@@ -1157,7 +1158,6 @@ setMethod("toRDD",
 #'
 #' @param x a SparkDataFrame
 #' @return a GroupedData
-#' @seealso GroupedData
 #' @family SparkDataFrame functions
 #' @rdname groupBy
 #' @name groupBy
@@ -1242,9 +1242,9 @@ dapplyInternal <- function(x, func, schema) {
 #'
 #' @param x A SparkDataFrame
 #' @param func A function to be applied to each partition of the 
SparkDataFrame.
-#' func should have only one parameter, to which a data.frame 
corresponds
+#' func should have only one parameter, to which a R data.frame 
corresponds
 #' to each partition will be passed.
-#' The output of func should be a data.frame.
+#' The output of func should be a R data.frame.
 #' @param schema The schema of the resulting SparkDataFrame after the function 
is applied.
 #'   It must match the output of func.
 #' @family SparkDataFrame functions
@@ -1291,9 +1291,9 @@ setMethod("dapply",
 #'
 #' @param x A SparkDataFrame
 #' @param func A function to be applied to each partition of the 
SparkDataFrame.
-#' func should have only one parameter, to which a data.frame 
corresponds
+#' func should have only one parameter, to which a R data.frame 
corresponds
 #' to each partition will be passed.
-#' The output of func should be a data.frame.
+#' The output of func should be a R data.frame.
 #' @family SparkDataFrame functions
 #' @rdname dapplyCollect
 #' @name dapplyCo

spark git commit: [SPARK-15894][SQL][DOC] Update docs for controlling #partitions

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master 58f6e27dd -> 41e0ffb19


[SPARK-15894][SQL][DOC] Update docs for controlling #partitions

## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and 
`spark.sql.files.openCostInBytes ` in Other Configuration Options.

## How was this patch tested?
N/A

Author: Takeshi YAMAMURO 

Closes #13797 from maropu/SPARK-15894-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e0ffb1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e0ffb1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e0ffb1

Branch: refs/heads/master
Commit: 41e0ffb19f678e9b1e87f747a5e4e3d44964e39a
Parents: 58f6e27
Author: Takeshi YAMAMURO 
Authored: Tue Jun 21 14:27:16 2016 +0800
Committer: Cheng Lian 
Committed: Tue Jun 21 14:27:16 2016 +0800

--
 docs/sql-programming-guide.md | 17 +
 1 file changed, 17 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/41e0ffb1/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 4206f73..ddf8f70 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -2016,6 +2016,23 @@ that these options will be deprecated in future release 
as more optimizations ar
 
   Property NameDefaultMeaning
   
+spark.sql.files.maxPartitionBytes
+134217728 (128 MB)
+
+  The maximum number of bytes to pack into a single partition when reading 
files.
+
+  
+  
+spark.sql.files.openCostInBytes
+4194304 (4 MB)
+
+  The estimated cost to open a file, measured by the number of bytes could 
be scanned in the same
+  time. This is used when putting multiple files into a partition. It is 
better to over estimated,
+  then the partitions with small files will be faster than partitions with 
bigger files (which is
+  scheduled first).
+
+  
+  
 spark.sql.autoBroadcastJoinThreshold
 10485760 (10 MB)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15894][SQL][DOC] Update docs for controlling #partitions

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 dbf7f48b6 -> 4e193d3da


[SPARK-15894][SQL][DOC] Update docs for controlling #partitions

## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and 
`spark.sql.files.openCostInBytes ` in Other Configuration Options.

## How was this patch tested?
N/A

Author: Takeshi YAMAMURO 

Closes #13797 from maropu/SPARK-15894-2.

(cherry picked from commit 41e0ffb19f678e9b1e87f747a5e4e3d44964e39a)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e193d3d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e193d3d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e193d3d

Branch: refs/heads/branch-2.0
Commit: 4e193d3daf5bdfb38d7df6da5b7abdd53888ec99
Parents: dbf7f48
Author: Takeshi YAMAMURO 
Authored: Tue Jun 21 14:27:16 2016 +0800
Committer: Cheng Lian 
Committed: Tue Jun 21 14:27:31 2016 +0800

--
 docs/sql-programming-guide.md | 17 +
 1 file changed, 17 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4e193d3d/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 4206f73..ddf8f70 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -2016,6 +2016,23 @@ that these options will be deprecated in future release 
as more optimizations ar
 
   Property NameDefaultMeaning
   
+spark.sql.files.maxPartitionBytes
+134217728 (128 MB)
+
+  The maximum number of bytes to pack into a single partition when reading 
files.
+
+  
+  
+spark.sql.files.openCostInBytes
+4194304 (4 MB)
+
+  The estimated cost to open a file, measured by the number of bytes could 
be scanned in the same
+  time. This is used when putting multiple files into a partition. It is 
better to over estimated,
+  then the partitions with small files will be faster than partitions with 
bigger files (which is
+  scheduled first).
+
+  
+  
 spark.sql.autoBroadcastJoinThreshold
 10485760 (10 MB)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4fc4eb943 -> dbf7f48b6


[SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include 
sparkSession in R

## What changes were proposed in this pull request?

Update doc as per discussion in PR #13592

## How was this patch tested?

manual

shivaram liancheng

Author: Felix Cheung 

Closes #13799 from felixcheung/rsqlprogrammingguide.

(cherry picked from commit 58f6e27dd70f476f99ac8204e6b405bced4d6de1)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbf7f48b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbf7f48b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbf7f48b

Branch: refs/heads/branch-2.0
Commit: dbf7f48b6e73f3500b0abe9055ac204a3f756418
Parents: 4fc4eb9
Author: Felix Cheung 
Authored: Tue Jun 21 13:56:37 2016 +0800
Committer: Cheng Lian 
Committed: Tue Jun 21 13:57:03 2016 +0800

--
 docs/sparkr.md|  2 +-
 docs/sql-programming-guide.md | 34 --
 2 files changed, 17 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dbf7f48b/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 023bbcd..f018901 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -152,7 +152,7 @@ write.df(people, path="people.parquet", source="parquet", 
mode="overwrite")
 
 ### From Hive tables
 
-You can also create SparkDataFrames from Hive tables. To do this we will need 
to create a SparkSession with Hive support which can access tables in the Hive 
MetaStore. Note that Spark should have been built with [Hive 
support](building-spark.html#building-with-hive-and-jdbc-support) and more 
details can be found in the [SQL programming 
guide](sql-programming-guide.html#starting-point-sqlcontext). In SparkR, by 
default it will attempt to create a SparkSession with Hive support enabled 
(`enableHiveSupport = TRUE`).
+You can also create SparkDataFrames from Hive tables. To do this we will need 
to create a SparkSession with Hive support which can access tables in the Hive 
MetaStore. Note that Spark should have been built with [Hive 
support](building-spark.html#building-with-hive-and-jdbc-support) and more 
details can be found in the [SQL programming 
guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by 
default it will attempt to create a SparkSession with Hive support enabled 
(`enableHiveSupport = TRUE`).
 
 
 {% highlight r %}

http://git-wip-us.apache.org/repos/asf/spark/blob/dbf7f48b/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d93f30b..4206f73 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -107,19 +107,17 @@ spark = SparkSession.build \
 
 
 
-Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` 
to `SparkSession` for SparkR yet, so
-the entry point into all relational functionality in SparkR is still the
-`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need 
is a `SparkContext`.
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
 
 {% highlight r %}
-spark <- sparkRSQL.init(sc)
+sparkR.session()
 {% endhighlight %}
 
-Note that when invoked for the first time, `sparkRSQL.init()` initializes a 
global `SQLContext` singleton instance, and always returns a reference to this 
instance for successive invocations. In this way, users only need to initialize 
the `SQLContext` once, then SparkR functions like `read.df` will be able to 
access this global instance implicitly, and users don't need to pass the 
`SQLContext` instance around.
+Note that when invoked for the first time, `sparkR.session()` initializes a 
global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only need to 
initialize the `SparkSession` once, then SparkR functions like `read.df` will 
be able to access this global instance implicitly, and users don't need to pass 
the `SparkSession` instance around.
 
 
 
-`SparkSession` (or `SQLContext` for SparkR) in Spark 2.0 provides builtin 
support for Hive features including the ability to
+`SparkSession` in Spark 2.0 provides builtin support for Hive features 
including the ability to
 write queries using HiveQL, access to Hive UDFs, and the ability to read data 
from Hive tables.
 To use these features, you do not need to have an existing Hive setup.
 
@@ -175,7 +173,7 @@ df.show()

spark git commit: [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master 07367533d -> 58f6e27dd


[SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include 
sparkSession in R

## What changes were proposed in this pull request?

Update doc as per discussion in PR #13592

## How was this patch tested?

manual

shivaram liancheng

Author: Felix Cheung 

Closes #13799 from felixcheung/rsqlprogrammingguide.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/58f6e27d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/58f6e27d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/58f6e27d

Branch: refs/heads/master
Commit: 58f6e27dd70f476f99ac8204e6b405bced4d6de1
Parents: 0736753
Author: Felix Cheung 
Authored: Tue Jun 21 13:56:37 2016 +0800
Committer: Cheng Lian 
Committed: Tue Jun 21 13:56:37 2016 +0800

--
 docs/sparkr.md|  2 +-
 docs/sql-programming-guide.md | 34 --
 2 files changed, 17 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/58f6e27d/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 023bbcd..f018901 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -152,7 +152,7 @@ write.df(people, path="people.parquet", source="parquet", 
mode="overwrite")
 
 ### From Hive tables
 
-You can also create SparkDataFrames from Hive tables. To do this we will need 
to create a SparkSession with Hive support which can access tables in the Hive 
MetaStore. Note that Spark should have been built with [Hive 
support](building-spark.html#building-with-hive-and-jdbc-support) and more 
details can be found in the [SQL programming 
guide](sql-programming-guide.html#starting-point-sqlcontext). In SparkR, by 
default it will attempt to create a SparkSession with Hive support enabled 
(`enableHiveSupport = TRUE`).
+You can also create SparkDataFrames from Hive tables. To do this we will need 
to create a SparkSession with Hive support which can access tables in the Hive 
MetaStore. Note that Spark should have been built with [Hive 
support](building-spark.html#building-with-hive-and-jdbc-support) and more 
details can be found in the [SQL programming 
guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by 
default it will attempt to create a SparkSession with Hive support enabled 
(`enableHiveSupport = TRUE`).
 
 
 {% highlight r %}

http://git-wip-us.apache.org/repos/asf/spark/blob/58f6e27d/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d93f30b..4206f73 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -107,19 +107,17 @@ spark = SparkSession.build \
 
 
 
-Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` 
to `SparkSession` for SparkR yet, so
-the entry point into all relational functionality in SparkR is still the
-`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need 
is a `SparkContext`.
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
 
 {% highlight r %}
-spark <- sparkRSQL.init(sc)
+sparkR.session()
 {% endhighlight %}
 
-Note that when invoked for the first time, `sparkRSQL.init()` initializes a 
global `SQLContext` singleton instance, and always returns a reference to this 
instance for successive invocations. In this way, users only need to initialize 
the `SQLContext` once, then SparkR functions like `read.df` will be able to 
access this global instance implicitly, and users don't need to pass the 
`SQLContext` instance around.
+Note that when invoked for the first time, `sparkR.session()` initializes a 
global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only need to 
initialize the `SparkSession` once, then SparkR functions like `read.df` will 
be able to access this global instance implicitly, and users don't need to pass 
the `SparkSession` instance around.
 
 
 
-`SparkSession` (or `SQLContext` for SparkR) in Spark 2.0 provides builtin 
support for Hive features including the ability to
+`SparkSession` in Spark 2.0 provides builtin support for Hive features 
including the ability to
 write queries using HiveQL, access to Hive UDFs, and the ability to read data 
from Hive tables.
 To use these features, you do not need to have an existing Hive setup.
 
@@ -175,7 +173,7 @@ df.show()
 
 
 
-With a `SQLContext`, applications can create DataFrames from an [existing 
`RDD`](#interoperating-wit

spark git commit: [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 12f00b6ed -> 4fc4eb943


[SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0

This has changed from 1.6, and now stores memory off-heap using spark's 
off-heap support instead of in tachyon.

Author: Eric Liang 

Closes #13744 from ericl/spark-16025.

(cherry picked from commit 07367533de68817e1e6cf9cf2b056a04dd160c8a)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4fc4eb94
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4fc4eb94
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4fc4eb94

Branch: refs/heads/branch-2.0
Commit: 4fc4eb9434676d6c7be1b0dd8ff1dc67d7d2b308
Parents: 12f00b6
Author: Eric Liang 
Authored: Mon Jun 20 21:56:44 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:56:49 2016 -0700

--
 docs/programming-guide.md | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4fc4eb94/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 97bcb51..3872aec 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1220,6 +1220,11 @@ storage levels is:
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.  
Same as the levels above, but replicate each partition on two cluster 
nodes. 
 
+
+   OFF_HEAP (experimental) 
+   Similar to MEMORY_ONLY_SER, but store the data in
+off-heap memory. This 
requires off-heap memory to be enabled. 
+
 
 
 **Note:** *In Python, stored objects will always be serialized with the 
[Pickle](https://docs.python.org/2/library/pickle.html) library, 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 4f7f1c436 -> 07367533d


[SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0

This has changed from 1.6, and now stores memory off-heap using spark's 
off-heap support instead of in tachyon.

Author: Eric Liang 

Closes #13744 from ericl/spark-16025.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07367533
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07367533
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07367533

Branch: refs/heads/master
Commit: 07367533de68817e1e6cf9cf2b056a04dd160c8a
Parents: 4f7f1c4
Author: Eric Liang 
Authored: Mon Jun 20 21:56:44 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:56:44 2016 -0700

--
 docs/programming-guide.md | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07367533/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 97bcb51..3872aec 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1220,6 +1220,11 @@ storage levels is:
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.  
Same as the levels above, but replicate each partition on two cluster 
nodes. 
 
+
+   OFF_HEAP (experimental) 
+   Similar to MEMORY_ONLY_SER, but store the data in
+off-heap memory. This 
requires off-heap memory to be enabled. 
+
 
 
 **Note:** *In Python, stored objects will always be serialized with the 
[Pickle](https://docs.python.org/2/library/pickle.html) library, 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9d513b8d2 -> 12f00b6ed


[SPARK-16044][SQL] input_file_name() returns empty strings in data sources 
based on NewHadoopRDD

## What changes were proposed in this pull request?

This PR makes `input_file_name()` function return the file paths not empty 
strings for external data sources based on `NewHadoopRDD`, such as 
[spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149)
 and 
[spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).

The codes with the external data sources below:

```scala
df.select(input_file_name).show()
```

will produce

- **Before**
  ```
+-+
|input_file_name()|
+-+
| |
+-+
```

- **After**
  ```
++
|   input_file_name()|
++
|file:/private/var...|
++
```

## How was this patch tested?

Unit tests in `ColumnExpressionSuite`.

Author: hyukjinkwon 

Closes #13759 from HyukjinKwon/SPARK-16044.

(cherry picked from commit 4f7f1c436205630ab77d3758d7210cc1a2f0d04a)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12f00b6e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12f00b6e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12f00b6e

Branch: refs/heads/branch-2.0
Commit: 12f00b6edde9b6f97d2450e2cd99edd5e31b9169
Parents: 9d513b8
Author: hyukjinkwon 
Authored: Mon Jun 20 21:55:34 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:55:40 2016 -0700

--
 .../apache/spark/rdd/InputFileNameHolder.scala  |  2 +-
 .../org/apache/spark/rdd/NewHadoopRDD.scala |  7 
 .../spark/sql/ColumnExpressionSuite.scala   | 34 ++--
 3 files changed, 40 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala 
b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
index 108e9d2..f40d4c8 100644
--- a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
@@ -21,7 +21,7 @@ import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * This holds file names of the current Spark task. This is used in HadoopRDD,
- * FileScanRDD and InputFileName function in Spark SQL.
+ * FileScanRDD, NewHadoopRDD and InputFileName function in Spark SQL.
  */
 private[spark] object InputFileNameHolder {
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
index 189dc7b..b086baa 100644
--- a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
@@ -135,6 +135,12 @@ class NewHadoopRDD[K, V](
   val inputMetrics = context.taskMetrics().inputMetrics
   val existingBytesRead = inputMetrics.bytesRead
 
+  // Sets the thread local variable for the file's name
+  split.serializableHadoopSplit.value match {
+case fs: FileSplit => 
InputFileNameHolder.setInputFileName(fs.getPath.toString)
+case _ => InputFileNameHolder.unsetInputFileName()
+  }
+
   // Find a function that will return the FileSystem bytes read by this 
thread. Do this before
   // creating RecordReader, because RecordReader's constructor might read 
some bytes
   val getBytesReadCallback: Option[() => Long] = 
split.serializableHadoopSplit.value match {
@@ -201,6 +207,7 @@ class NewHadoopRDD[K, V](
 
   private def close() {
 if (reader != null) {
+  InputFileNameHolder.unsetInputFileName()
   // Close the reader and release it. Note: it's very important that 
we don't close the
   // reader more than once, since that exposes us to MAPREDUCE-5918 
when running against
   // Hadoop 1.x and older Hadoop 2.x releases. That bug can lead to 
non-deterministic

http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.

spark git commit: [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 18a8a9b1f -> 4f7f1c436


[SPARK-16044][SQL] input_file_name() returns empty strings in data sources 
based on NewHadoopRDD

## What changes were proposed in this pull request?

This PR makes `input_file_name()` function return the file paths not empty 
strings for external data sources based on `NewHadoopRDD`, such as 
[spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149)
 and 
[spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).

The codes with the external data sources below:

```scala
df.select(input_file_name).show()
```

will produce

- **Before**
  ```
+-+
|input_file_name()|
+-+
| |
+-+
```

- **After**
  ```
++
|   input_file_name()|
++
|file:/private/var...|
++
```

## How was this patch tested?

Unit tests in `ColumnExpressionSuite`.

Author: hyukjinkwon 

Closes #13759 from HyukjinKwon/SPARK-16044.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f7f1c43
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f7f1c43
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f7f1c43

Branch: refs/heads/master
Commit: 4f7f1c436205630ab77d3758d7210cc1a2f0d04a
Parents: 18a8a9b
Author: hyukjinkwon 
Authored: Mon Jun 20 21:55:34 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:55:34 2016 -0700

--
 .../apache/spark/rdd/InputFileNameHolder.scala  |  2 +-
 .../org/apache/spark/rdd/NewHadoopRDD.scala |  7 
 .../spark/sql/ColumnExpressionSuite.scala   | 34 ++--
 3 files changed, 40 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala 
b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
index 108e9d2..f40d4c8 100644
--- a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala
@@ -21,7 +21,7 @@ import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * This holds file names of the current Spark task. This is used in HadoopRDD,
- * FileScanRDD and InputFileName function in Spark SQL.
+ * FileScanRDD, NewHadoopRDD and InputFileName function in Spark SQL.
  */
 private[spark] object InputFileNameHolder {
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
index 189dc7b..b086baa 100644
--- a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
@@ -135,6 +135,12 @@ class NewHadoopRDD[K, V](
   val inputMetrics = context.taskMetrics().inputMetrics
   val existingBytesRead = inputMetrics.bytesRead
 
+  // Sets the thread local variable for the file's name
+  split.serializableHadoopSplit.value match {
+case fs: FileSplit => 
InputFileNameHolder.setInputFileName(fs.getPath.toString)
+case _ => InputFileNameHolder.unsetInputFileName()
+  }
+
   // Find a function that will return the FileSystem bytes read by this 
thread. Do this before
   // creating RecordReader, because RecordReader's constructor might read 
some bytes
   val getBytesReadCallback: Option[() => Long] = 
split.serializableHadoopSplit.value match {
@@ -201,6 +207,7 @@ class NewHadoopRDD[K, V](
 
   private def close() {
 if (reader != null) {
+  InputFileNameHolder.unsetInputFileName()
   // Close the reader and release it. Note: it's very important that 
we don't close the
   // reader more than once, since that exposes us to MAPREDUCE-5918 
when running against
   // Hadoop 1.x and older Hadoop 2.x releases. That bug can lead to 
non-deterministic

http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
index e89fa32..a66c83d 1

spark git commit: [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master d9a3a2a0b -> 18a8a9b1f


[SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API

## What changes were proposed in this pull request?

Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
is private in Spark. However, in order to let developers implement their own 
transformers and estimators, we should expose both types in a public API to 
simply the implementation of transformSchema, transform, etc. Otherwise, they 
need to get the data types using reflection.

## How was this patch tested?

Unit tests in Scala and Java.

Author: Xiangrui Meng 

Closes #13789 from mengxr/SPARK-16074.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18a8a9b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18a8a9b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18a8a9b1

Branch: refs/heads/master
Commit: 18a8a9b1f4114211cd108efda5672f2bd2c6e5cd
Parents: d9a3a2a
Author: Xiangrui Meng 
Authored: Mon Jun 20 21:51:02 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:51:02 2016 -0700

--
 .../org/apache/spark/ml/linalg/dataTypes.scala  | 35 
 .../spark/ml/linalg/JavaSQLDataTypesSuite.java  | 31 +
 .../spark/ml/linalg/SQLDataTypesSuite.scala | 27 +++
 3 files changed, 93 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/18a8a9b1/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
new file mode 100644
index 000..52a6fd2
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.linalg
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.sql.types.DataType
+
+/**
+ * :: DeveloperApi ::
+ * SQL data types for vectors and matrices.
+ */
+@DeveloperApi
+object sqlDataTypes {
+
+  /** Data type for [[Vector]]. */
+  val VectorType: DataType = new VectorUDT
+
+  /** Data type for [[Matrix]]. */
+  val MatrixType: DataType = new MatrixUDT
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/18a8a9b1/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
--
diff --git 
a/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java 
b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
new file mode 100644
index 000..b09e131
--- /dev/null
+++ b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.linalg;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.spark.ml.linalg.sqlDataTypes.*;
+
+public class JavaSQLDataTypesSuite {
+  @Test
+  public void testSQLDataTypes() {
+Assert.assertEquals(new VectorUDT(), VectorType());
+Assert.assertEquals(new MatrixUDT(), MatrixType());
+  }
+}

http://git-wip-us

spark git commit: [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b998c33c0 -> 9d513b8d2


[SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API

## What changes were proposed in this pull request?

Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
is private in Spark. However, in order to let developers implement their own 
transformers and estimators, we should expose both types in a public API to 
simply the implementation of transformSchema, transform, etc. Otherwise, they 
need to get the data types using reflection.

## How was this patch tested?

Unit tests in Scala and Java.

Author: Xiangrui Meng 

Closes #13789 from mengxr/SPARK-16074.

(cherry picked from commit 18a8a9b1f4114211cd108efda5672f2bd2c6e5cd)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d513b8d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d513b8d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d513b8d

Branch: refs/heads/branch-2.0
Commit: 9d513b8d220657cd6c4ab6f182f446b4107d
Parents: b998c33
Author: Xiangrui Meng 
Authored: Mon Jun 20 21:51:02 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:51:06 2016 -0700

--
 .../org/apache/spark/ml/linalg/dataTypes.scala  | 35 
 .../spark/ml/linalg/JavaSQLDataTypesSuite.java  | 31 +
 .../spark/ml/linalg/SQLDataTypesSuite.scala | 27 +++
 3 files changed, 93 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9d513b8d/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
new file mode 100644
index 000..52a6fd2
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.linalg
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.sql.types.DataType
+
+/**
+ * :: DeveloperApi ::
+ * SQL data types for vectors and matrices.
+ */
+@DeveloperApi
+object sqlDataTypes {
+
+  /** Data type for [[Vector]]. */
+  val VectorType: DataType = new VectorUDT
+
+  /** Data type for [[Matrix]]. */
+  val MatrixType: DataType = new MatrixUDT
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/9d513b8d/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
--
diff --git 
a/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java 
b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
new file mode 100644
index 000..b09e131
--- /dev/null
+++ b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.linalg;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.spark.ml.linalg.sqlDataTypes.*;
+
+public class JavaSQLDataTypesSuite {
+  @Test
+  public void testSQLDataTypes() {
+Assert.assertEquals(new Vecto

spark git commit: [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 603424c16 -> b998c33c0


[SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column 
Partitioning in JDBC Source

 What changes were proposed in this pull request?
This PR is to fix the following bugs:

**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column 
Partitioning**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 4,
  upperBound = 0,
  numPartitions = 3,
  connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
  Part 0 id < 3 or id is null
  Part 1 id >= 3 AND id < 2
  Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than 
the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper 
and lower bounds**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 1,
  upperBound = 5,
  numPartitions = 10,
  connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, 
like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 0,
  upperBound = 4,
  numPartitions = 0,
  connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
  java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when 
numPartitions is equal to or less than zero

 How was this patch tested?
Added test cases to verify the results

Author: gatorsmile 

Closes #13773 from gatorsmile/jdbcPartitioning.

(cherry picked from commit d9a3a2a0bec504d17d3b94104d449ee3bd850120)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b998c33c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b998c33c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b998c33c

Branch: refs/heads/branch-2.0
Commit: b998c33c0d38f8f724d8846bc8e919ec8b92012e
Parents: 603424c
Author: gatorsmile 
Authored: Mon Jun 20 21:49:33 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:49:39 2016 -0700

--
 .../datasources/jdbc/JDBCRelation.scala | 48 ++-
 .../org/apache/spark/sql/jdbc/JDBCSuite.scala   | 65 
 2 files changed, 98 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b998c33c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
index 233b789..11613dd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
@@ -21,6 +21,7 @@ import java.util.Properties
 
 import scala.collection.mutable.ArrayBuffer
 
+import org.apache.spark.internal.Logging
 import org.apache.spark.Partition
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession, 
SQLContext}
@@ -36,7 +37,7 @@ private[sql] case class JDBCPartitioningInfo(
 upperBound: Long,
 numPartitions: Int)
 
-private[sql] object JDBCRelation {
+private[sql] object JDBCRelation extends Logging {
   /**
* Given a partitioning schematic (a column of integral type, a number of
* partitions, and upper and lower bounds on the column's value), generate
@@ -52,29 +53,46 @@ private[sql] object JDBCRelation {
* @return an array of partitions with where clause for each partition
*/
   def columnPartition(partitioning: JDBCPartitioningInfo): Array[Partition] = {
-if (parti

spark git commit: [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master c775bf09e -> d9a3a2a0b


[SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column 
Partitioning in JDBC Source

 What changes were proposed in this pull request?
This PR is to fix the following bugs:

**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column 
Partitioning**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 4,
  upperBound = 0,
  numPartitions = 3,
  connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
  Part 0 id < 3 or id is null
  Part 1 id >= 3 AND id < 2
  Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than 
the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper 
and lower bounds**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 1,
  upperBound = 5,
  numPartitions = 10,
  connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, 
like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 0,
  upperBound = 4,
  numPartitions = 0,
  connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
  java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when 
numPartitions is equal to or less than zero

 How was this patch tested?
Added test cases to verify the results

Author: gatorsmile 

Closes #13773 from gatorsmile/jdbcPartitioning.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d9a3a2a0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d9a3a2a0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d9a3a2a0

Branch: refs/heads/master
Commit: d9a3a2a0bec504d17d3b94104d449ee3bd850120
Parents: c775bf0
Author: gatorsmile 
Authored: Mon Jun 20 21:49:33 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:49:33 2016 -0700

--
 .../datasources/jdbc/JDBCRelation.scala | 48 ++-
 .../org/apache/spark/sql/jdbc/JDBCSuite.scala   | 65 
 2 files changed, 98 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d9a3a2a0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
index 233b789..11613dd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
@@ -21,6 +21,7 @@ import java.util.Properties
 
 import scala.collection.mutable.ArrayBuffer
 
+import org.apache.spark.internal.Logging
 import org.apache.spark.Partition
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession, 
SQLContext}
@@ -36,7 +37,7 @@ private[sql] case class JDBCPartitioningInfo(
 upperBound: Long,
 numPartitions: Int)
 
-private[sql] object JDBCRelation {
+private[sql] object JDBCRelation extends Logging {
   /**
* Given a partitioning schematic (a column of integral type, a number of
* partitions, and upper and lower bounds on the column's value), generate
@@ -52,29 +53,46 @@ private[sql] object JDBCRelation {
* @return an array of partitions with where clause for each partition
*/
   def columnPartition(partitioning: JDBCPartitioningInfo): Array[Partition] = {
-if (partitioning == null) return Array[Partition](JDBCPartition(null, 0))
+if (partitioning == null || partitio

spark git commit: [SPARK-13792][SQL] Limit logging of bad records in CSV data source

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 10c476fc8 -> 603424c16


[SPARK-13792][SQL] Limit logging of bad records in CSV data source

## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader 
to limit the maximum of logging message Spark generates per partition for 
malformed records.

The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been 
found on this partition. Malformed records from now on will not be logged.
```

Closes #12173

## How was this patch tested?
Manually tested.

Author: Reynold Xin 

Closes #13795 from rxin/SPARK-13792.

(cherry picked from commit c775bf09e0c3540f76de3f15d3fd35112a4912c1)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/603424c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/603424c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/603424c1

Branch: refs/heads/branch-2.0
Commit: 603424c161e9be670ee8461053225364cc700515
Parents: 10c476f
Author: Reynold Xin 
Authored: Mon Jun 20 21:46:12 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:46:20 2016 -0700

--
 python/pyspark/sql/readwriter.py|  4 ++
 .../org/apache/spark/sql/DataFrameReader.scala  |  2 +
 .../datasources/csv/CSVFileFormat.scala |  9 -
 .../execution/datasources/csv/CSVOptions.scala  |  2 +
 .../execution/datasources/csv/CSVRelation.scala | 42 +---
 5 files changed, 44 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 72fd184..89506ca 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils):
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it uses 
the default value,
   ``100``.
+:param maxMalformedLogPerPartition: sets the maximum number of 
malformed rows Spark will
+log for each partition. Malformed 
records beyond this
+number will be ignored. If None is 
set, it
+uses the default value, ``10``.
 :param mode: allows a mode for dealing with corrupt records during 
parsing. If None is
  set, it uses the default value, ``PERMISSIVE``.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 841503b..35ba9c5 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -382,6 +382,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* a record can have.
* `maxCharsPerColumn` (default `100`): defines the maximum number 
of characters allowed
* for any given value being read.
+   * `maxMalformedLogPerPartition` (default `10`): sets the maximum number 
of malformed rows
+   * Spark will log for each partition. Malformed records beyond this number 
will be ignored.
* `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
*during parsing.
* 

http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
--
diff --git

spark git commit: [SPARK-13792][SQL] Limit logging of bad records in CSV data source

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 217db56ba -> c775bf09e


[SPARK-13792][SQL] Limit logging of bad records in CSV data source

## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader 
to limit the maximum of logging message Spark generates per partition for 
malformed records.

The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been 
found on this partition. Malformed records from now on will not be logged.
```

Closes #12173

## How was this patch tested?
Manually tested.

Author: Reynold Xin 

Closes #13795 from rxin/SPARK-13792.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c775bf09
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c775bf09
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c775bf09

Branch: refs/heads/master
Commit: c775bf09e0c3540f76de3f15d3fd35112a4912c1
Parents: 217db56
Author: Reynold Xin 
Authored: Mon Jun 20 21:46:12 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 21:46:12 2016 -0700

--
 python/pyspark/sql/readwriter.py|  4 ++
 .../org/apache/spark/sql/DataFrameReader.scala  |  2 +
 .../datasources/csv/CSVFileFormat.scala |  9 -
 .../execution/datasources/csv/CSVOptions.scala  |  2 +
 .../execution/datasources/csv/CSVRelation.scala | 42 +---
 5 files changed, 44 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 72fd184..89506ca 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils):
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it uses 
the default value,
   ``100``.
+:param maxMalformedLogPerPartition: sets the maximum number of 
malformed rows Spark will
+log for each partition. Malformed 
records beyond this
+number will be ignored. If None is 
set, it
+uses the default value, ``10``.
 :param mode: allows a mode for dealing with corrupt records during 
parsing. If None is
  set, it uses the default value, ``PERMISSIVE``.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 841503b..35ba9c5 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -382,6 +382,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* a record can have.
* `maxCharsPerColumn` (default `100`): defines the maximum number 
of characters allowed
* for any given value being read.
+   * `maxMalformedLogPerPartition` (default `10`): sets the maximum number 
of malformed rows
+   * Spark will log for each partition. Malformed records beyond this number 
will be ignored.
* `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
*during parsing.
* 

http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 
b/sql/core/

spark git commit: [SPARK-15294][R] Add `pivot` to SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 087bd2799 -> 10c476fc8


[SPARK-15294][R] Add `pivot` to SparkR

## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based 
on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for 
the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun 

Closes #13786 from dongjoon-hyun/SPARK-15294.

(cherry picked from commit 217db56ba11fcdf9e3a81946667d1d99ad7344ee)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10c476fc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10c476fc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10c476fc

Branch: refs/heads/branch-2.0
Commit: 10c476fc8f4780e487d8ada626f6924866f5711f
Parents: 087bd27
Author: Dongjoon Hyun 
Authored: Mon Jun 20 21:09:39 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 21:09:51 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/generics.R|  4 +++
 R/pkg/R/group.R   | 43 ++
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 25 +++
 4 files changed, 73 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 45663f4..ea42888 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -294,6 +294,7 @@ exportMethods("%in%",
 
 exportClasses("GroupedData")
 exportMethods("agg")
+exportMethods("pivot")
 
 export("as.DataFrame",
"cacheTable",

http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 3fb6370..c307de7 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -160,6 +160,10 @@ setGeneric("persist", function(x, newLevel) { 
standardGeneric("persist") })
 # @export
 setGeneric("pipeRDD", function(x, command, env = list()) { 
standardGeneric("pipeRDD")})
 
+# @rdname pivot
+# @export
+setGeneric("pivot", function(x, colname, values = list()) { 
standardGeneric("pivot") })
+
 # @rdname reduce
 # @export
 setGeneric("reduce", function(x, func) { standardGeneric("reduce") })

http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/R/group.R
--
diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R
index 51e1516..0687f14 100644
--- a/R/pkg/R/group.R
+++ b/R/pkg/R/group.R
@@ -134,6 +134,49 @@ methods <- c("avg", "max", "mean", "min", "sum")
 # These are not exposed on GroupedData: "kurtosis", "skewness", "stddev", 
"stddev_samp", "stddev_pop",
 # "variance", "var_samp", "var_pop"
 
+#' Pivot a column of the GroupedData and perform the specified aggregation.
+#'
+#' Pivot a column of the GroupedData and perform the specified aggregation.
+#' There are two versions of pivot function: one that requires the caller to 
specify the list
+#' of distinct values to pivot on, and one that does not. The latter is more 
concise but less
+#' efficient, because Spark needs to first compute the list of distinct values 
internally.
+#'
+#' @param x a GroupedData object
+#' @param colname A column name
+#' @param values A value or a list/vector of distinct values for the output 
columns.
+#' @return GroupedData object
+#' @rdname pivot
+#' @name pivot
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(data.frame(
+#' earnings = c(1, 1, 11000, 15000, 12000, 2, 21000, 22000),
+#' course = c("R", "Python", "R", "Python", "R", "Python", "R", "Python"),
+#' period = c("1H", "1H", "2H", "2H", "1H", "1H", "2H", "2H"),
+#' year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016)
+#' ))
+#' group_sum <- sum(pivot(groupBy(df, "year"), "course"), "earnings")
+#' group_min <- min(pivot(groupBy(df, "year"), "course", "R"), "earnings")
+#' group_max <- max(pivot(groupBy(df, "year"), "course", c("Python", "R")), 
"earnings")
+#' group_mean <- mean(pivot(groupBy(df, "year"), "course", list("Python", 
"R")), "earnings")
+#' }
+#' @note pivot since 2.0.0
+setMethod("pivot",
+  signature(x = "GroupedData", colname = "character"),
+  function(x, colname, values = list()){
+stopifnot(length(colname) == 1)
+if (length(values) == 0) {
+  result <- callJMethod(x@sgd, "pivot", colname)
+} else {
+  if (length(values) > length(unique(values))) {
+stop("Values are not unique")
+  }
+

spark git commit: [SPARK-15294][R] Add `pivot` to SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master a46553cba -> 217db56ba


[SPARK-15294][R] Add `pivot` to SparkR

## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based 
on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for 
the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun 

Closes #13786 from dongjoon-hyun/SPARK-15294.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/217db56b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/217db56b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/217db56b

Branch: refs/heads/master
Commit: 217db56ba11fcdf9e3a81946667d1d99ad7344ee
Parents: a46553c
Author: Dongjoon Hyun 
Authored: Mon Jun 20 21:09:39 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 21:09:39 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/generics.R|  4 +++
 R/pkg/R/group.R   | 43 ++
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 25 +++
 4 files changed, 73 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 45663f4..ea42888 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -294,6 +294,7 @@ exportMethods("%in%",
 
 exportClasses("GroupedData")
 exportMethods("agg")
+exportMethods("pivot")
 
 export("as.DataFrame",
"cacheTable",

http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 3fb6370..c307de7 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -160,6 +160,10 @@ setGeneric("persist", function(x, newLevel) { 
standardGeneric("persist") })
 # @export
 setGeneric("pipeRDD", function(x, command, env = list()) { 
standardGeneric("pipeRDD")})
 
+# @rdname pivot
+# @export
+setGeneric("pivot", function(x, colname, values = list()) { 
standardGeneric("pivot") })
+
 # @rdname reduce
 # @export
 setGeneric("reduce", function(x, func) { standardGeneric("reduce") })

http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/R/group.R
--
diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R
index 51e1516..0687f14 100644
--- a/R/pkg/R/group.R
+++ b/R/pkg/R/group.R
@@ -134,6 +134,49 @@ methods <- c("avg", "max", "mean", "min", "sum")
 # These are not exposed on GroupedData: "kurtosis", "skewness", "stddev", 
"stddev_samp", "stddev_pop",
 # "variance", "var_samp", "var_pop"
 
+#' Pivot a column of the GroupedData and perform the specified aggregation.
+#'
+#' Pivot a column of the GroupedData and perform the specified aggregation.
+#' There are two versions of pivot function: one that requires the caller to 
specify the list
+#' of distinct values to pivot on, and one that does not. The latter is more 
concise but less
+#' efficient, because Spark needs to first compute the list of distinct values 
internally.
+#'
+#' @param x a GroupedData object
+#' @param colname A column name
+#' @param values A value or a list/vector of distinct values for the output 
columns.
+#' @return GroupedData object
+#' @rdname pivot
+#' @name pivot
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(data.frame(
+#' earnings = c(1, 1, 11000, 15000, 12000, 2, 21000, 22000),
+#' course = c("R", "Python", "R", "Python", "R", "Python", "R", "Python"),
+#' period = c("1H", "1H", "2H", "2H", "1H", "1H", "2H", "2H"),
+#' year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016)
+#' ))
+#' group_sum <- sum(pivot(groupBy(df, "year"), "course"), "earnings")
+#' group_min <- min(pivot(groupBy(df, "year"), "course", "R"), "earnings")
+#' group_max <- max(pivot(groupBy(df, "year"), "course", c("Python", "R")), 
"earnings")
+#' group_mean <- mean(pivot(groupBy(df, "year"), "course", list("Python", 
"R")), "earnings")
+#' }
+#' @note pivot since 2.0.0
+setMethod("pivot",
+  signature(x = "GroupedData", colname = "character"),
+  function(x, colname, values = list()){
+stopifnot(length(colname) == 1)
+if (length(values) == 0) {
+  result <- callJMethod(x@sgd, "pivot", colname)
+} else {
+  if (length(values) > length(unique(values))) {
+stop("Values are not unique")
+  }
+  result <- callJMethod(x@sgd, "pivot", colname, as.list(values))
+}
+groupedData(resu

spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

2016-06-20 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master e2b7eba87 -> a46553cba


[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

Fix the bug for Python UDF that does not have any arguments.

Added regression tests.

Author: Davies Liu 

Closes #13793 from davies/fix_no_arguments.

(cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a46553cb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a46553cb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a46553cb

Branch: refs/heads/master
Commit: a46553cbacf0e4012df89fe55385dec5beaa680a
Parents: e2b7eba
Author: Davies Liu 
Authored: Mon Jun 20 20:50:30 2016 -0700
Committer: Davies Liu 
Committed: Mon Jun 20 20:53:45 2016 -0700

--
 python/pyspark/sql/tests.py | 5 +
 python/pyspark/sql/types.py | 9 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a46553cb/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index c631ad8..ecd1a05 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -318,6 +318,11 @@ class SQLTests(ReusedPySparkTestCase):
 [row] = self.spark.sql("SELECT double(add(1, 2)), add(double(2), 
1)").collect()
 self.assertEqual(tuple(row), (6, 5))
 
+def test_udf_without_arguments(self):
+self.sqlCtx.registerFunction("foo", lambda: "bar")
+[row] = self.sqlCtx.sql("SELECT foo()").collect()
+self.assertEqual(row[0], "bar")
+
 def test_udf_with_array_type(self):
 d = [Row(l=list(range(3)), d={"key": list(range(5))})]
 rdd = self.sc.parallelize(d)

http://git-wip-us.apache.org/repos/asf/spark/blob/a46553cb/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index bb2b954..f0b56be 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1401,11 +1401,7 @@ class Row(tuple):
 if args and kwargs:
 raise ValueError("Can not use both args "
  "and kwargs to create Row")
-if args:
-# create row class or objects
-return tuple.__new__(self, args)
-
-elif kwargs:
+if kwargs:
 # create row objects
 names = sorted(kwargs.keys())
 row = tuple.__new__(self, [kwargs[n] for n in names])
@@ -1413,7 +1409,8 @@ class Row(tuple):
 return row
 
 else:
-raise ValueError("No args or kwargs")
+# create row class or objects
+return tuple.__new__(self, args)
 
 def asDict(self, recursive=False):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

2016-06-20 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f57317690 -> 087bd2799


[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

Fix the bug for Python UDF that does not have any arguments.

Added regression tests.

Author: Davies Liu 

Closes #13793 from davies/fix_no_arguments.

(cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/087bd279
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/087bd279
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/087bd279

Branch: refs/heads/branch-2.0
Commit: 087bd2799366f4914d248e9b1f0fb921adbbdb43
Parents: f573176
Author: Davies Liu 
Authored: Mon Jun 20 20:50:30 2016 -0700
Committer: Davies Liu 
Committed: Mon Jun 20 20:52:55 2016 -0700

--
 python/pyspark/sql/tests.py | 5 +
 python/pyspark/sql/types.py | 9 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/087bd279/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index c631ad8..ecd1a05 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -318,6 +318,11 @@ class SQLTests(ReusedPySparkTestCase):
 [row] = self.spark.sql("SELECT double(add(1, 2)), add(double(2), 
1)").collect()
 self.assertEqual(tuple(row), (6, 5))
 
+def test_udf_without_arguments(self):
+self.sqlCtx.registerFunction("foo", lambda: "bar")
+[row] = self.sqlCtx.sql("SELECT foo()").collect()
+self.assertEqual(row[0], "bar")
+
 def test_udf_with_array_type(self):
 d = [Row(l=list(range(3)), d={"key": list(range(5))})]
 rdd = self.sc.parallelize(d)

http://git-wip-us.apache.org/repos/asf/spark/blob/087bd279/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index bb2b954..f0b56be 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1401,11 +1401,7 @@ class Row(tuple):
 if args and kwargs:
 raise ValueError("Can not use both args "
  "and kwargs to create Row")
-if args:
-# create row class or objects
-return tuple.__new__(self, args)
-
-elif kwargs:
+if kwargs:
 # create row objects
 names = sorted(kwargs.keys())
 row = tuple.__new__(self, [kwargs[n] for n in names])
@@ -1413,7 +1409,8 @@ class Row(tuple):
 return row
 
 else:
-raise ValueError("No args or kwargs")
+# create row class or objects
+return tuple.__new__(self, args)
 
 def asDict(self, recursive=False):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

2016-06-20 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 1891e04a6 -> 6001138fd


[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

## What changes were proposed in this pull request?

Fix the bug for Python UDF that does not have any arguments.

## How was this patch tested?

Added regression tests.

Author: Davies Liu 

Closes #13793 from davies/fix_no_arguments.

(cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6001138f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6001138f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6001138f

Branch: refs/heads/branch-1.5
Commit: 6001138fd68f2318028519d09563f12874b54e7d
Parents: 1891e04
Author: Davies Liu 
Authored: Mon Jun 20 20:50:30 2016 -0700
Committer: Davies Liu 
Committed: Mon Jun 20 20:50:57 2016 -0700

--
 python/pyspark/sql/tests.py | 5 +
 python/pyspark/sql/types.py | 9 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6001138f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 27c9d45..86e2dfb 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -286,6 +286,11 @@ class SQLTests(ReusedPySparkTestCase):
 [res] = self.sqlCtx.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 
1").collect()
 self.assertEqual(4, res[0])
 
+def test_udf_without_arguments(self):
+self.sqlCtx.registerFunction("foo", lambda: "bar")
+[row] = self.sqlCtx.sql("SELECT foo()").collect()
+self.assertEqual(row[0], "bar")
+
 def test_udf_with_array_type(self):
 d = [Row(l=list(range(3)), d={"key": list(range(5))})]
 rdd = self.sc.parallelize(d)

http://git-wip-us.apache.org/repos/asf/spark/blob/6001138f/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index b0ac207..db4cc42 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1193,11 +1193,7 @@ class Row(tuple):
 if args and kwargs:
 raise ValueError("Can not use both args "
  "and kwargs to create Row")
-if args:
-# create row class or objects
-return tuple.__new__(self, args)
-
-elif kwargs:
+if kwargs:
 # create row objects
 names = sorted(kwargs.keys())
 row = tuple.__new__(self, [kwargs[n] for n in names])
@@ -1205,7 +1201,8 @@ class Row(tuple):
 return row
 
 else:
-raise ValueError("No args or kwargs")
+# create row class or objects
+return tuple.__new__(self, args)
 
 def asDict(self, recursive=False):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

2016-06-20 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 db86e7fd2 -> abe36c53d


[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)

## What changes were proposed in this pull request?

Fix the bug for Python UDF that does not have any arguments.

## How was this patch tested?

Added regression tests.

Author: Davies Liu 

Closes #13793 from davies/fix_no_arguments.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/abe36c53
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/abe36c53
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/abe36c53

Branch: refs/heads/branch-1.6
Commit: abe36c53d126bb580e408a45245fd8e81806869c
Parents: db86e7f
Author: Davies Liu 
Authored: Mon Jun 20 20:50:30 2016 -0700
Committer: Davies Liu 
Committed: Mon Jun 20 20:50:30 2016 -0700

--
 python/pyspark/sql/tests.py | 5 +
 python/pyspark/sql/types.py | 9 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/abe36c53/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 0dc4274..43eb6ec 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -305,6 +305,11 @@ class SQLTests(ReusedPySparkTestCase):
 [res] = self.sqlCtx.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 
1").collect()
 self.assertEqual(4, res[0])
 
+def test_udf_without_arguments(self):
+self.sqlCtx.registerFunction("foo", lambda: "bar")
+[row] = self.sqlCtx.sql("SELECT foo()").collect()
+self.assertEqual(row[0], "bar")
+
 def test_udf_with_array_type(self):
 d = [Row(l=list(range(3)), d={"key": list(range(5))})]
 rdd = self.sc.parallelize(d)

http://git-wip-us.apache.org/repos/asf/spark/blob/abe36c53/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 5bc0773..211b01f 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1195,11 +1195,7 @@ class Row(tuple):
 if args and kwargs:
 raise ValueError("Can not use both args "
  "and kwargs to create Row")
-if args:
-# create row class or objects
-return tuple.__new__(self, args)
-
-elif kwargs:
+if kwargs:
 # create row objects
 names = sorted(kwargs.keys())
 row = tuple.__new__(self, [kwargs[n] for n in names])
@@ -1207,7 +1203,8 @@ class Row(tuple):
 return row
 
 else:
-raise ValueError("No args or kwargs")
+# create row class or objects
+return tuple.__new__(self, args)
 
 def asDict(self, recursive=False):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: remove duplicated docs in dapply

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 c7006538a -> f57317690


remove duplicated docs in dapply

## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.

In this pull request I created separate R docs for dapply and dapplyCollect - 
kept dapply's documentation separate from dapplyCollect's and referred from one 
to another via a link.

## How was this patch tested?
Existing test cases.

Author: Narine Kokhlikyan 

Closes #13790 from NarineK/dapply-docs-fix.

(cherry picked from commit e2b7eba87cdf67fa737c32f5f6ca075445ff28cb)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5731769
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5731769
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5731769

Branch: refs/heads/branch-2.0
Commit: f573176902ebff0fd6a2f572c94a2cca3e057b72
Parents: c700653
Author: Narine Kokhlikyan 
Authored: Mon Jun 20 19:36:51 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 19:36:58 2016 -0700

--
 R/pkg/R/DataFrame.R | 4 +++-
 R/pkg/R/generics.R  | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5731769/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index ecdcd6e..b3f2dd8 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1250,6 +1250,7 @@ dapplyInternal <- function(x, func, schema) {
 #' @family SparkDataFrame functions
 #' @rdname dapply
 #' @name dapply
+#' @seealso \link{dapplyCollect}
 #' @export
 #' @examples
 #' \dontrun{
@@ -1294,8 +1295,9 @@ setMethod("dapply",
 #' to each partition will be passed.
 #' The output of func should be a data.frame.
 #' @family SparkDataFrame functions
-#' @rdname dapply
+#' @rdname dapplyCollect
 #' @name dapplyCollect
+#' @seealso \link{dapply}
 #' @export
 #' @examples
 #' \dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/f5731769/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index f6b9276..3fb6370 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -457,7 +457,7 @@ setGeneric("createOrReplaceTempView",
 #' @export
 setGeneric("dapply", function(x, func, schema) { standardGeneric("dapply") })
 
-#' @rdname dapply
+#' @rdname dapplyCollect
 #' @export
 setGeneric("dapplyCollect", function(x, func) { 
standardGeneric("dapplyCollect") })
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: remove duplicated docs in dapply

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master a42bf5553 -> e2b7eba87


remove duplicated docs in dapply

## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.

In this pull request I created separate R docs for dapply and dapplyCollect - 
kept dapply's documentation separate from dapplyCollect's and referred from one 
to another via a link.

## How was this patch tested?
Existing test cases.

Author: Narine Kokhlikyan 

Closes #13790 from NarineK/dapply-docs-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2b7eba8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2b7eba8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2b7eba8

Branch: refs/heads/master
Commit: e2b7eba87cdf67fa737c32f5f6ca075445ff28cb
Parents: a42bf55
Author: Narine Kokhlikyan 
Authored: Mon Jun 20 19:36:51 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 19:36:51 2016 -0700

--
 R/pkg/R/DataFrame.R | 4 +++-
 R/pkg/R/generics.R  | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e2b7eba8/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index ecdcd6e..b3f2dd8 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1250,6 +1250,7 @@ dapplyInternal <- function(x, func, schema) {
 #' @family SparkDataFrame functions
 #' @rdname dapply
 #' @name dapply
+#' @seealso \link{dapplyCollect}
 #' @export
 #' @examples
 #' \dontrun{
@@ -1294,8 +1295,9 @@ setMethod("dapply",
 #' to each partition will be passed.
 #' The output of func should be a data.frame.
 #' @family SparkDataFrame functions
-#' @rdname dapply
+#' @rdname dapplyCollect
 #' @name dapplyCollect
+#' @seealso \link{dapply}
 #' @export
 #' @examples
 #' \dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/e2b7eba8/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index f6b9276..3fb6370 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -457,7 +457,7 @@ setGeneric("createOrReplaceTempView",
 #' @export
 setGeneric("dapply", function(x, func, schema) { standardGeneric("dapply") })
 
-#' @rdname dapply
+#' @rdname dapplyCollect
 #' @export
 setGeneric("dapplyCollect", function(x, func) { 
standardGeneric("dapplyCollect") })
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel

2016-06-20 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b40663541 -> c7006538a


[SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel 
used in GBTClassificationModel

## What changes were proposed in this pull request?

Fixed missing import for DecisionTreeRegressionModel used in 
GBTClassificationModel trees method.

## How was this patch tested?

Local tests

Author: Bryan Cutler 

Closes #13787 from 
BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.

(cherry picked from commit a42bf555326b75c8251be77db68105c29e8c95c4)
Signed-off-by: Xiangrui Meng 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7006538
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7006538
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7006538

Branch: refs/heads/branch-2.0
Commit: c7006538a88bee85e0292bc9564ae8bfdf734ed6
Parents: b406635
Author: Bryan Cutler 
Authored: Mon Jun 20 16:28:11 2016 -0700
Committer: Xiangrui Meng 
Committed: Mon Jun 20 16:28:19 2016 -0700

--
 python/pyspark/ml/classification.py | 6 --
 python/pyspark/ml/regression.py | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c7006538/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 121b926..a3cd917 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -21,8 +21,8 @@ import warnings
 from pyspark import since, keyword_only
 from pyspark.ml import Estimator, Model
 from pyspark.ml.param.shared import *
-from pyspark.ml.regression import (
-RandomForestParams, TreeEnsembleParams, DecisionTreeModel, 
TreeEnsembleModels)
+from pyspark.ml.regression import DecisionTreeModel, 
DecisionTreeRegressionModel, \
+RandomForestParams, TreeEnsembleModels, TreeEnsembleParams
 from pyspark.ml.util import *
 from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams
 from pyspark.ml.wrapper import JavaWrapper
@@ -798,6 +798,8 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 True
 >>> model.treeWeights == model2.treeWeights
 True
+>>> model.trees
+[DecisionTreeRegressionModel (uid=...) of depth..., 
DecisionTreeRegressionModel...]
 
 .. versionadded:: 1.4.0
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/c7006538/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index db31993..8d2378d 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -994,6 +994,8 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol,
 True
 >>> model.treeWeights == model2.treeWeights
 True
+>>> model.trees
+[DecisionTreeRegressionModel (uid=...) of depth..., 
DecisionTreeRegressionModel...]
 
 .. versionadded:: 1.4.0
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel

2016-06-20 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 6daa8cf1a -> a42bf5553


[SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel 
used in GBTClassificationModel

## What changes were proposed in this pull request?

Fixed missing import for DecisionTreeRegressionModel used in 
GBTClassificationModel trees method.

## How was this patch tested?

Local tests

Author: Bryan Cutler 

Closes #13787 from 
BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a42bf555
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a42bf555
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a42bf555

Branch: refs/heads/master
Commit: a42bf555326b75c8251be77db68105c29e8c95c4
Parents: 6daa8cf
Author: Bryan Cutler 
Authored: Mon Jun 20 16:28:11 2016 -0700
Committer: Xiangrui Meng 
Committed: Mon Jun 20 16:28:11 2016 -0700

--
 python/pyspark/ml/classification.py | 6 --
 python/pyspark/ml/regression.py | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a42bf555/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 121b926..a3cd917 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -21,8 +21,8 @@ import warnings
 from pyspark import since, keyword_only
 from pyspark.ml import Estimator, Model
 from pyspark.ml.param.shared import *
-from pyspark.ml.regression import (
-RandomForestParams, TreeEnsembleParams, DecisionTreeModel, 
TreeEnsembleModels)
+from pyspark.ml.regression import DecisionTreeModel, 
DecisionTreeRegressionModel, \
+RandomForestParams, TreeEnsembleModels, TreeEnsembleParams
 from pyspark.ml.util import *
 from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams
 from pyspark.ml.wrapper import JavaWrapper
@@ -798,6 +798,8 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 True
 >>> model.treeWeights == model2.treeWeights
 True
+>>> model.trees
+[DecisionTreeRegressionModel (uid=...) of depth..., 
DecisionTreeRegressionModel...]
 
 .. versionadded:: 1.4.0
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/a42bf555/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index db31993..8d2378d 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -994,6 +994,8 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol,
 True
 >>> model.treeWeights == model2.treeWeights
 True
+>>> model.trees
+[DecisionTreeRegressionModel (uid=...) of depth..., 
DecisionTreeRegressionModel...]
 
 .. versionadded:: 1.4.0
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval"

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master b99129cc4 -> 6daa8cf1a


[SPARK-16061][SQL][MINOR] The property 
"spark.streaming.stateStore.maintenanceInterval" should be renamed to 
"spark.sql.streaming.stateStore.maintenanceInterval"

## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed 
and harmonized with other properties related to Structured Streaming like 
spark.sql.streaming.stateStore.minDeltasForSnapshot.

## How was this patch tested?
Existing unit tests.

Author: Kousuke Saruta 

Closes #13777 from sarutak/SPARK-16061.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6daa8cf1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6daa8cf1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6daa8cf1

Branch: refs/heads/master
Commit: 6daa8cf1a642a669cd3a0305036c4390e4336a73
Parents: b99129c
Author: Kousuke Saruta 
Authored: Mon Jun 20 15:12:40 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 15:12:40 2016 -0700

--
 .../apache/spark/sql/execution/streaming/state/StateStore.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6daa8cf1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
index 9948292..0667653 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
@@ -115,7 +115,7 @@ case class KeyRemoved(key: UnsafeRow) extends StoreUpdate
  */
 private[sql] object StateStore extends Logging {
 
-  val MAINTENANCE_INTERVAL_CONFIG = 
"spark.streaming.stateStore.maintenanceInterval"
+  val MAINTENANCE_INTERVAL_CONFIG = 
"spark.sql.streaming.stateStore.maintenanceInterval"
   val MAINTENANCE_INTERVAL_DEFAULT_SECS = 60
 
   private val loadedProviders = new mutable.HashMap[StateStoreId, 
StateStoreProvider]()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval"

2016-06-20 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 54001cb12 -> b40663541


[SPARK-16061][SQL][MINOR] The property 
"spark.streaming.stateStore.maintenanceInterval" should be renamed to 
"spark.sql.streaming.stateStore.maintenanceInterval"

## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed 
and harmonized with other properties related to Structured Streaming like 
spark.sql.streaming.stateStore.minDeltasForSnapshot.

## How was this patch tested?
Existing unit tests.

Author: Kousuke Saruta 

Closes #13777 from sarutak/SPARK-16061.

(cherry picked from commit 6daa8cf1a642a669cd3a0305036c4390e4336a73)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4066354
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4066354
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4066354

Branch: refs/heads/branch-2.0
Commit: b4066354141b933cdfdfdf266c6d4ff21338dcdf
Parents: 54001cb
Author: Kousuke Saruta 
Authored: Mon Jun 20 15:12:40 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jun 20 15:12:45 2016 -0700

--
 .../apache/spark/sql/execution/streaming/state/StateStore.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b4066354/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
index 9948292..0667653 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
@@ -115,7 +115,7 @@ case class KeyRemoved(key: UnsafeRow) extends StoreUpdate
  */
 private[sql] object StateStore extends Logging {
 
-  val MAINTENANCE_INTERVAL_CONFIG = 
"spark.streaming.stateStore.maintenanceInterval"
+  val MAINTENANCE_INTERVAL_CONFIG = 
"spark.sql.streaming.stateStore.maintenanceInterval"
   val MAINTENANCE_INTERVAL_DEFAULT_SECS = 60
 
   private val loadedProviders = new mutable.HashMap[StateStoreId, 
StateStoreProvider]()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc

2016-06-20 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 8159da20e -> 54001cb12


[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of 
DataFrameReader.text/csv/json/parquet/orc

## What changes were proposed in this pull request?

Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, 
its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, 
it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - 
[SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no 
args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)

The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. 
For json, parquet, csv, text, this means adding json(string), etc.. For orc, 
this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty 
dataframe with no fields. Rather pass on the empty sequence of paths to the 
datasource, and let each datasource handle it right. For e.g, text data source, 
should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.

## How was this patch tested?
Added new unit tests for Scala and Java tests

Author: Tathagata Das 

Closes #13727 from tdas/SPARK-15982.

(cherry picked from commit b99129cc452defc266f6d357f5baab5f4ff37a36)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54001cb1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54001cb1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54001cb1

Branch: refs/heads/branch-2.0
Commit: 54001cb129674be9f2459368fb608367f52371c2
Parents: 8159da2
Author: Tathagata Das 
Authored: Mon Jun 20 14:52:28 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jun 20 14:52:35 2016 -0700

--
 .../org/apache/spark/sql/DataFrameReader.scala  | 132 +
 .../sql/JavaDataFrameReaderWriterSuite.java | 158 
 .../sql/test/DataFrameReaderWriterSuite.scala   | 186 ---
 3 files changed, 420 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/54001cb1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 2ae854d..841503b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -119,13 +119,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(): DataFrame = {
-val dataSource =
-  DataSource(
-sparkSession,
-userSpecifiedSchema = userSpecifiedSchema,
-className = source,
-options = extraOptions.toMap)
-Dataset.ofRows(sparkSession, LogicalRelation(dataSource.resolveRelation()))
+load(Seq.empty: _*) // force invocation of `load(...varargs...)`
   }
 
   /**
@@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load()
+load(Seq(path): _*) // force invocation of `load(...varargs...)`
   }
 
   /**
@@ -146,18 +140,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
*/
   @scala.annotation.varargs
   def load(paths: String*): DataFrame = {
-if (paths.isEmpty) {
-  sparkSession.emptyDataFrame
-} else {
-  sparkSession.baseRelationToDataFrame(
-DataSource.apply(
-  sparkSession,
-  paths = paths,
-  userSpecifiedSchema = userSpecifiedSchema,
-  className = source,
-  options = extraOptions.toMap).resolveRelation())
-}
+sparkSession.baseRelationToDataFrame(
+  DataSource.apply(
+sparkSession,
+paths = paths,
+userSpecifiedSchema = userSpecifiedSchema,
+className = source,
+options = extraOptions.toMap).resolveRelation())
   }
+
   /**
* Construct a [[DataFrame]] representing the database table accessible via 
JDBC URL
* url named table and connection properties.
@@ -247,11 +238,23 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) ex

spark git commit: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc

2016-06-20 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 6df8e3886 -> b99129cc4


[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of 
DataFrameReader.text/csv/json/parquet/orc

## What changes were proposed in this pull request?

Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, 
its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, 
it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - 
[SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no 
args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)

The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. 
For json, parquet, csv, text, this means adding json(string), etc.. For orc, 
this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty 
dataframe with no fields. Rather pass on the empty sequence of paths to the 
datasource, and let each datasource handle it right. For e.g, text data source, 
should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.

## How was this patch tested?
Added new unit tests for Scala and Java tests

Author: Tathagata Das 

Closes #13727 from tdas/SPARK-15982.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b99129cc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b99129cc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b99129cc

Branch: refs/heads/master
Commit: b99129cc452defc266f6d357f5baab5f4ff37a36
Parents: 6df8e38
Author: Tathagata Das 
Authored: Mon Jun 20 14:52:28 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jun 20 14:52:28 2016 -0700

--
 .../org/apache/spark/sql/DataFrameReader.scala  | 132 +
 .../sql/JavaDataFrameReaderWriterSuite.java | 158 
 .../sql/test/DataFrameReaderWriterSuite.scala   | 186 ---
 3 files changed, 420 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b99129cc/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 2ae854d..841503b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -119,13 +119,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(): DataFrame = {
-val dataSource =
-  DataSource(
-sparkSession,
-userSpecifiedSchema = userSpecifiedSchema,
-className = source,
-options = extraOptions.toMap)
-Dataset.ofRows(sparkSession, LogicalRelation(dataSource.resolveRelation()))
+load(Seq.empty: _*) // force invocation of `load(...varargs...)`
   }
 
   /**
@@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load()
+load(Seq(path): _*) // force invocation of `load(...varargs...)`
   }
 
   /**
@@ -146,18 +140,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
*/
   @scala.annotation.varargs
   def load(paths: String*): DataFrame = {
-if (paths.isEmpty) {
-  sparkSession.emptyDataFrame
-} else {
-  sparkSession.baseRelationToDataFrame(
-DataSource.apply(
-  sparkSession,
-  paths = paths,
-  userSpecifiedSchema = userSpecifiedSchema,
-  className = source,
-  options = extraOptions.toMap).resolveRelation())
-}
+sparkSession.baseRelationToDataFrame(
+  DataSource.apply(
+sparkSession,
+paths = paths,
+userSpecifiedSchema = userSpecifiedSchema,
+className = source,
+options = extraOptions.toMap).resolveRelation())
   }
+
   /**
* Construct a [[DataFrame]] representing the database table accessible via 
JDBC URL
* url named table and connection properties.
@@ -247,11 +238,23 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 
   /**
* Loads a JSON file (one object per line) and returns the result as a 
[[DataF

spark git commit: [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0

2016-06-20 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 54aef1c14 -> 8159da20e


[SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0

## What changes were proposed in this pull request?

Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 
migration guide are still incomplete.

We may also want to add more examples for Scala/Java Dataset typed 
transformations.

## How was this patch tested?

N/A

Author: Cheng Lian 

Closes #13592 from liancheng/sql-programming-guide-2.0.

(cherry picked from commit 6df8e3886063a9d8c2e8499456ea9166245d5640)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8159da20
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8159da20
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8159da20

Branch: refs/heads/branch-2.0
Commit: 8159da20ee9c170324772792f2b242a85cbb7d34
Parents: 54aef1c
Author: Cheng Lian 
Authored: Mon Jun 20 14:50:28 2016 -0700
Committer: Yin Huai 
Committed: Mon Jun 20 14:50:46 2016 -0700

--
 docs/sql-programming-guide.md | 605 +++--
 1 file changed, 317 insertions(+), 288 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8159da20/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index efdf873..d93f30b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
 Spark SQL is a Spark module for structured data processing. Unlike the basic 
Spark RDD API, the interfaces provided
 by Spark SQL provide Spark with more information about the structure of both 
the data and the computation being performed. Internally,
 Spark SQL uses this extra information to perform extra optimizations. There 
are several ways to
-interact with Spark SQL including SQL, the DataFrames API and the Datasets 
API. When computing a result
+interact with Spark SQL including SQL and the Dataset API. When computing a 
result
 the same execution engine is used, independent of which API/language you are 
using to express the
-computation. This unification means that developers can easily switch back and 
forth between the
-various APIs based on which provides the most natural way to express a given 
transformation.
+computation. This unification means that developers can easily switch back and 
forth between
+different APIs based on which provides the most natural way to express a given 
transformation.
 
 All of the examples on this page use sample data included in the Spark 
distribution and can be run in
 the `spark-shell`, `pyspark` shell, or `sparkR` shell.
 
 ## SQL
 
-One use of Spark SQL is to execute SQL queries written using either a basic 
SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
 Spark SQL can also be used to read data from an existing Hive installation. 
For more on how to
 configure this feature, please refer to the [Hive Tables](#hive-tables) 
section. When running
-SQL from within another programming language the results will be returned as a 
[DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a 
[DataFrame](#datasets-and-dataframes).
 You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
 or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
 
-## DataFrames
+## Datasets and DataFrames
 
-A DataFrame is a distributed collection of data organized into named columns. 
It is conceptually
-equivalent to a table in a relational database or a data frame in R/Python, 
but with richer
-optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](#data-sources) such
-as: structured data files, tables in Hive, external databases, or existing 
RDDs.
+A Dataset is a new interface added in Spark 1.6 that tries to provide the 
benefits of RDDs (strong
+typing, ability to use powerful lambda functions) with the benefits of Spark 
SQL's optimized
+execution engine. A Dataset can be [constructed](#creating-datasets) from JVM 
objects and then
+manipulated using functional transformations (`map`, `flatMap`, `filter`, 
etc.).
 
-The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
-[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
-[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
+The Dataset API is the successor of the DataFrame API, which was introduced in 
Spark 1.3. In Spark
+2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to 
Datasets of `Row`s.
+In fact, `DataFrame` is si

spark git commit: [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0

2016-06-20 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master d0eddb80e -> 6df8e3886


[SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0

## What changes were proposed in this pull request?

Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 
migration guide are still incomplete.

We may also want to add more examples for Scala/Java Dataset typed 
transformations.

## How was this patch tested?

N/A

Author: Cheng Lian 

Closes #13592 from liancheng/sql-programming-guide-2.0.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6df8e388
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6df8e388
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6df8e388

Branch: refs/heads/master
Commit: 6df8e3886063a9d8c2e8499456ea9166245d5640
Parents: d0eddb8
Author: Cheng Lian 
Authored: Mon Jun 20 14:50:28 2016 -0700
Committer: Yin Huai 
Committed: Mon Jun 20 14:50:28 2016 -0700

--
 docs/sql-programming-guide.md | 605 +++--
 1 file changed, 317 insertions(+), 288 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6df8e388/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index efdf873..d93f30b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
 Spark SQL is a Spark module for structured data processing. Unlike the basic 
Spark RDD API, the interfaces provided
 by Spark SQL provide Spark with more information about the structure of both 
the data and the computation being performed. Internally,
 Spark SQL uses this extra information to perform extra optimizations. There 
are several ways to
-interact with Spark SQL including SQL, the DataFrames API and the Datasets 
API. When computing a result
+interact with Spark SQL including SQL and the Dataset API. When computing a 
result
 the same execution engine is used, independent of which API/language you are 
using to express the
-computation. This unification means that developers can easily switch back and 
forth between the
-various APIs based on which provides the most natural way to express a given 
transformation.
+computation. This unification means that developers can easily switch back and 
forth between
+different APIs based on which provides the most natural way to express a given 
transformation.
 
 All of the examples on this page use sample data included in the Spark 
distribution and can be run in
 the `spark-shell`, `pyspark` shell, or `sparkR` shell.
 
 ## SQL
 
-One use of Spark SQL is to execute SQL queries written using either a basic 
SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
 Spark SQL can also be used to read data from an existing Hive installation. 
For more on how to
 configure this feature, please refer to the [Hive Tables](#hive-tables) 
section. When running
-SQL from within another programming language the results will be returned as a 
[DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a 
[DataFrame](#datasets-and-dataframes).
 You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
 or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
 
-## DataFrames
+## Datasets and DataFrames
 
-A DataFrame is a distributed collection of data organized into named columns. 
It is conceptually
-equivalent to a table in a relational database or a data frame in R/Python, 
but with richer
-optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](#data-sources) such
-as: structured data files, tables in Hive, external databases, or existing 
RDDs.
+A Dataset is a new interface added in Spark 1.6 that tries to provide the 
benefits of RDDs (strong
+typing, ability to use powerful lambda functions) with the benefits of Spark 
SQL's optimized
+execution engine. A Dataset can be [constructed](#creating-datasets) from JVM 
objects and then
+manipulated using functional transformations (`map`, `flatMap`, `filter`, 
etc.).
 
-The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
-[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
-[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
+The Dataset API is the successor of the DataFrame API, which was introduced in 
Spark 1.3. In Spark
+2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to 
Datasets of `Row`s.
+In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala 
API][scala-datasets].
+However, [Java API][java-data

[1/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f90b2ea1d -> 54aef1c14


http://git-wip-us.apache.org/repos/asf/spark/blob/54aef1c1/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 2127dae..d6ff2aa 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -29,24 +29,28 @@
 #'
 #' @param jobj a Java object reference to the backing Scala 
GeneralizedLinearRegressionWrapper
 #' @export
+#' @note GeneralizedLinearRegressionModel since 2.0.0
 setClass("GeneralizedLinearRegressionModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a NaiveBayesModel
 #'
 #' @param jobj a Java object reference to the backing Scala NaiveBayesWrapper
 #' @export
+#' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a AFTSurvivalRegressionModel
 #'
 #' @param jobj a Java object reference to the backing Scala 
AFTSurvivalRegressionWrapper
 #' @export
+#' @note AFTSurvivalRegressionModel since 2.0.0
 setClass("AFTSurvivalRegressionModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a KMeansModel
 #'
 #' @param jobj a Java object reference to the backing Scala KMeansModel
 #' @export
+#' @note KMeansModel since 2.0.0
 setClass("KMeansModel", representation(jobj = "jobj"))
 
 #' Fits a generalized linear model
@@ -73,6 +77,7 @@ setClass("KMeansModel", representation(jobj = "jobj"))
 #' model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family="gaussian")
 #' summary(model)
 #' }
+#' @note spark.glm since 2.0.0
 setMethod(
 "spark.glm",
 signature(data = "SparkDataFrame", formula = "formula"),
@@ -120,6 +125,7 @@ setMethod(
 #' model <- glm(Sepal_Length ~ Sepal_Width, df, family="gaussian")
 #' summary(model)
 #' }
+#' @note glm since 1.5.0
 setMethod("glm", signature(formula = "formula", family = "ANY", data = 
"SparkDataFrame"),
   function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 
25) {
 spark.glm(data, formula, family, epsilon, maxit)
@@ -138,6 +144,7 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 #' model <- glm(y ~ x, trainingData)
 #' summary(model)
 #' }
+#' @note summary(GeneralizedLinearRegressionModel) since 2.0.0
 setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
   function(object, ...) {
 jobj <- object@jobj
@@ -173,6 +180,7 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 #' @rdname print
 #' @name print.summary.GeneralizedLinearRegressionModel
 #' @export
+#' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0
 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
   if (x$is.loaded) {
 cat("\nSaved-loaded model does not support output 'Deviance Residuals'.\n")
@@ -215,6 +223,7 @@ print.summary.GeneralizedLinearRegressionModel <- 
function(x, ...) {
 #' predicted <- predict(model, testData)
 #' showDF(predicted)
 #' }
+#' @note predict(GeneralizedLinearRegressionModel) since 1.5.0
 setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
@@ -236,6 +245,7 @@ setMethod("predict", signature(object = 
"GeneralizedLinearRegressionModel"),
 #' predicted <- predict(model, testData)
 #' showDF(predicted)
 #'}
+#' @note predict(NaiveBayesModel) since 2.0.0
 setMethod("predict", signature(object = "NaiveBayesModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
@@ -256,6 +266,7 @@ setMethod("predict", signature(object = "NaiveBayesModel"),
 #' model <- spark.naiveBayes(trainingData, y ~ x)
 #' summary(model)
 #'}
+#' @note summary(NaiveBayesModel) since 2.0.0
 setMethod("summary", signature(object = "NaiveBayesModel"),
   function(object, ...) {
 jobj <- object@jobj
@@ -289,6 +300,7 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
 #' \dontrun{
 #' model <- spark.kmeans(data, ~ ., k=2, initMode="random")
 #' }
+#' @note spark.kmeans since 2.0.0
 setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, k, maxIter = 10, initMode = c("random", 
"k-means||")) {
 formula <- paste(deparse(formula), collapse = "")
@@ -313,6 +325,7 @@ setMethod("spark.kmeans", signature(data = 
"SparkDataFrame", formula = "formula"
 #' fitted.model <- fitted(model)
 #' showDF(fitted.model)
 #'}
+#' @note fitted since 2.0.0
 setMethod("fitted", signature(object = "KMeansModel"),
   function(object, method = c("centers", "classes"), ...) {
 method <- match.arg(method)
@@ -339,6 +352,7 @@ setMethod("fitted", signature(object = "KMeansModel"),
 #' model <- spark.kmeans(trainingData, ~ ., 2)
 #' summary(model)
 #' }
+#'

[1/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 92514232e -> d0eddb80e


http://git-wip-us.apache.org/repos/asf/spark/blob/d0eddb80/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 2127dae..d6ff2aa 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -29,24 +29,28 @@
 #'
 #' @param jobj a Java object reference to the backing Scala 
GeneralizedLinearRegressionWrapper
 #' @export
+#' @note GeneralizedLinearRegressionModel since 2.0.0
 setClass("GeneralizedLinearRegressionModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a NaiveBayesModel
 #'
 #' @param jobj a Java object reference to the backing Scala NaiveBayesWrapper
 #' @export
+#' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a AFTSurvivalRegressionModel
 #'
 #' @param jobj a Java object reference to the backing Scala 
AFTSurvivalRegressionWrapper
 #' @export
+#' @note AFTSurvivalRegressionModel since 2.0.0
 setClass("AFTSurvivalRegressionModel", representation(jobj = "jobj"))
 
 #' S4 class that represents a KMeansModel
 #'
 #' @param jobj a Java object reference to the backing Scala KMeansModel
 #' @export
+#' @note KMeansModel since 2.0.0
 setClass("KMeansModel", representation(jobj = "jobj"))
 
 #' Fits a generalized linear model
@@ -73,6 +77,7 @@ setClass("KMeansModel", representation(jobj = "jobj"))
 #' model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family="gaussian")
 #' summary(model)
 #' }
+#' @note spark.glm since 2.0.0
 setMethod(
 "spark.glm",
 signature(data = "SparkDataFrame", formula = "formula"),
@@ -120,6 +125,7 @@ setMethod(
 #' model <- glm(Sepal_Length ~ Sepal_Width, df, family="gaussian")
 #' summary(model)
 #' }
+#' @note glm since 1.5.0
 setMethod("glm", signature(formula = "formula", family = "ANY", data = 
"SparkDataFrame"),
   function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 
25) {
 spark.glm(data, formula, family, epsilon, maxit)
@@ -138,6 +144,7 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 #' model <- glm(y ~ x, trainingData)
 #' summary(model)
 #' }
+#' @note summary(GeneralizedLinearRegressionModel) since 2.0.0
 setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
   function(object, ...) {
 jobj <- object@jobj
@@ -173,6 +180,7 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 #' @rdname print
 #' @name print.summary.GeneralizedLinearRegressionModel
 #' @export
+#' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0
 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
   if (x$is.loaded) {
 cat("\nSaved-loaded model does not support output 'Deviance Residuals'.\n")
@@ -215,6 +223,7 @@ print.summary.GeneralizedLinearRegressionModel <- 
function(x, ...) {
 #' predicted <- predict(model, testData)
 #' showDF(predicted)
 #' }
+#' @note predict(GeneralizedLinearRegressionModel) since 1.5.0
 setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
@@ -236,6 +245,7 @@ setMethod("predict", signature(object = 
"GeneralizedLinearRegressionModel"),
 #' predicted <- predict(model, testData)
 #' showDF(predicted)
 #'}
+#' @note predict(NaiveBayesModel) since 2.0.0
 setMethod("predict", signature(object = "NaiveBayesModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
@@ -256,6 +266,7 @@ setMethod("predict", signature(object = "NaiveBayesModel"),
 #' model <- spark.naiveBayes(trainingData, y ~ x)
 #' summary(model)
 #'}
+#' @note summary(NaiveBayesModel) since 2.0.0
 setMethod("summary", signature(object = "NaiveBayesModel"),
   function(object, ...) {
 jobj <- object@jobj
@@ -289,6 +300,7 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
 #' \dontrun{
 #' model <- spark.kmeans(data, ~ ., k=2, initMode="random")
 #' }
+#' @note spark.kmeans since 2.0.0
 setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, k, maxIter = 10, initMode = c("random", 
"k-means||")) {
 formula <- paste(deparse(formula), collapse = "")
@@ -313,6 +325,7 @@ setMethod("spark.kmeans", signature(data = 
"SparkDataFrame", formula = "formula"
 #' fitted.model <- fitted(model)
 #' showDF(fitted.model)
 #'}
+#' @note fitted since 2.0.0
 setMethod("fitted", signature(object = "KMeansModel"),
   function(object, method = c("centers", "classes"), ...) {
 method <- match.arg(method)
@@ -339,6 +352,7 @@ setMethod("fitted", signature(object = "KMeansModel"),
 #' model <- spark.kmeans(trainingData, ~ ., 2)
 #' summary(model)
 #' }
+#' @not

[2/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

2016-06-20 Thread shivaram

[SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

## What changes were proposed in this pull request?

This PR adds `since` tags to Roxygen documentation according to the previous 
documentation archive.

https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/

## How was this patch tested?

Manual.

Author: Dongjoon Hyun 

Closes #13734 from dongjoon-hyun/SPARK-14995.

(cherry picked from commit d0eddb80eca04e4f5f8af3b5143096cf67200277)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54aef1c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54aef1c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54aef1c1

Branch: refs/heads/branch-2.0
Commit: 54aef1c1414589b5143ec3cbbf3b1e17648b7067
Parents: f90b2ea
Author: Dongjoon Hyun 
Authored: Mon Jun 20 14:24:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 14:24:48 2016 -0700

--
 R/pkg/R/DataFrame.R  |  93 +++-
 R/pkg/R/SQLContext.R |  42 ++---
 R/pkg/R/WindowSpec.R |   8 +++
 R/pkg/R/column.R |  10 +++
 R/pkg/R/context.R|   3 +-
 R/pkg/R/functions.R  | 153 ++
 R/pkg/R/group.R  |   6 ++
 R/pkg/R/jobj.R   |   1 +
 R/pkg/R/mllib.R  |  24 
 R/pkg/R/schema.R |   5 +-
 R/pkg/R/sparkR.R |  18 +++---
 R/pkg/R/stats.R  |   6 ++
 R/pkg/R/utils.R  |   1 +
 R/pkg/R/window.R |   4 ++
 14 files changed, 340 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/54aef1c1/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 583d3ae..ecdcd6e 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -25,7 +25,7 @@ setOldClass("structType")
 
 #' S4 class that represents a SparkDataFrame
 #'
-#' DataFrames can be created using functions like \link{createDataFrame},
+#' SparkDataFrames can be created using functions like \link{createDataFrame},
 #' \link{read.json}, \link{table} etc.
 #'
 #' @family SparkDataFrame functions
@@ -42,6 +42,7 @@ setOldClass("structType")
 #' sparkR.session()
 #' df <- createDataFrame(faithful)
 #'}
+#' @note SparkDataFrame since 2.0.0
 setClass("SparkDataFrame",
  slots = list(env = "environment",
   sdf = "jobj"))
@@ -81,6 +82,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' df <- read.json(path)
 #' printSchema(df)
 #'}
+#' @note printSchema since 1.4.0
 setMethod("printSchema",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -105,6 +107,7 @@ setMethod("printSchema",
 #' df <- read.json(path)
 #' dfSchema <- schema(df)
 #'}
+#' @note schema since 1.4.0
 setMethod("schema",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -128,6 +131,7 @@ setMethod("schema",
 #' df <- read.json(path)
 #' explain(df, TRUE)
 #'}
+#' @note explain since 1.4.0
 setMethod("explain",
   signature(x = "SparkDataFrame"),
   function(x, extended = FALSE) {
@@ -158,6 +162,7 @@ setMethod("explain",
 #' df <- read.json(path)
 #' isLocal(df)
 #'}
+#' @note isLocal since 1.4.0
 setMethod("isLocal",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -182,6 +187,7 @@ setMethod("isLocal",
 #' df <- read.json(path)
 #' showDF(df)
 #'}
+#' @note showDF since 1.4.0
 setMethod("showDF",
   signature(x = "SparkDataFrame"),
   function(x, numRows = 20, truncate = TRUE) {
@@ -206,6 +212,7 @@ setMethod("showDF",
 #' df <- read.json(path)
 #' df
 #'}
+#' @note show(SparkDataFrame) since 1.4.0
 setMethod("show", "SparkDataFrame",
   function(object) {
 cols <- lapply(dtypes(object), function(l) {
@@ -232,6 +239,7 @@ setMethod("show", "SparkDataFrame",
 #' df <- read.json(path)
 #' dtypes(df)
 #'}
+#' @note dtypes since 1.4.0
 setMethod("dtypes",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -259,6 +267,7 @@ setMethod("dtypes",
 #' columns(df)
 #' colnames(df)
 #'}
+#' @note columns since 1.4.0
 setMethod("columns",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -269,6 +278,7 @@ setMethod("columns",
 
 #' @rdname columns
 #' @name names
+#' @note names since 1.5.0
 setMethod("names",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -277,6 +287,7 @@ setMethod("names",
 
 #' @rdname columns
 #' @name names<-
+#' @note names<- since 1.5.0
 setMethod("names<-",
   signature(x = "SparkDataFrame"),
   function(x, value) {
@@ -288,6 +299,7 @@ setMethod("names<-",
 
 #' @rdname columns
 #' @name colnames
+#' @note colnames since 1.6.0
 setMethod("colnames",
   signature(x = "SparkDataFrame")

[2/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

2016-06-20 Thread shivaram

[SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods

## What changes were proposed in this pull request?

This PR adds `since` tags to Roxygen documentation according to the previous 
documentation archive.

https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/

## How was this patch tested?

Manual.

Author: Dongjoon Hyun 

Closes #13734 from dongjoon-hyun/SPARK-14995.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d0eddb80
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d0eddb80
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d0eddb80

Branch: refs/heads/master
Commit: d0eddb80eca04e4f5f8af3b5143096cf67200277
Parents: 9251423
Author: Dongjoon Hyun 
Authored: Mon Jun 20 14:24:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 14:24:41 2016 -0700

--
 R/pkg/R/DataFrame.R  |  93 +++-
 R/pkg/R/SQLContext.R |  42 ++---
 R/pkg/R/WindowSpec.R |   8 +++
 R/pkg/R/column.R |  10 +++
 R/pkg/R/context.R|   3 +-
 R/pkg/R/functions.R  | 153 ++
 R/pkg/R/group.R  |   6 ++
 R/pkg/R/jobj.R   |   1 +
 R/pkg/R/mllib.R  |  24 
 R/pkg/R/schema.R |   5 +-
 R/pkg/R/sparkR.R |  18 +++---
 R/pkg/R/stats.R  |   6 ++
 R/pkg/R/utils.R  |   1 +
 R/pkg/R/window.R |   4 ++
 14 files changed, 340 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d0eddb80/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 583d3ae..ecdcd6e 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -25,7 +25,7 @@ setOldClass("structType")
 
 #' S4 class that represents a SparkDataFrame
 #'
-#' DataFrames can be created using functions like \link{createDataFrame},
+#' SparkDataFrames can be created using functions like \link{createDataFrame},
 #' \link{read.json}, \link{table} etc.
 #'
 #' @family SparkDataFrame functions
@@ -42,6 +42,7 @@ setOldClass("structType")
 #' sparkR.session()
 #' df <- createDataFrame(faithful)
 #'}
+#' @note SparkDataFrame since 2.0.0
 setClass("SparkDataFrame",
  slots = list(env = "environment",
   sdf = "jobj"))
@@ -81,6 +82,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' df <- read.json(path)
 #' printSchema(df)
 #'}
+#' @note printSchema since 1.4.0
 setMethod("printSchema",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -105,6 +107,7 @@ setMethod("printSchema",
 #' df <- read.json(path)
 #' dfSchema <- schema(df)
 #'}
+#' @note schema since 1.4.0
 setMethod("schema",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -128,6 +131,7 @@ setMethod("schema",
 #' df <- read.json(path)
 #' explain(df, TRUE)
 #'}
+#' @note explain since 1.4.0
 setMethod("explain",
   signature(x = "SparkDataFrame"),
   function(x, extended = FALSE) {
@@ -158,6 +162,7 @@ setMethod("explain",
 #' df <- read.json(path)
 #' isLocal(df)
 #'}
+#' @note isLocal since 1.4.0
 setMethod("isLocal",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -182,6 +187,7 @@ setMethod("isLocal",
 #' df <- read.json(path)
 #' showDF(df)
 #'}
+#' @note showDF since 1.4.0
 setMethod("showDF",
   signature(x = "SparkDataFrame"),
   function(x, numRows = 20, truncate = TRUE) {
@@ -206,6 +212,7 @@ setMethod("showDF",
 #' df <- read.json(path)
 #' df
 #'}
+#' @note show(SparkDataFrame) since 1.4.0
 setMethod("show", "SparkDataFrame",
   function(object) {
 cols <- lapply(dtypes(object), function(l) {
@@ -232,6 +239,7 @@ setMethod("show", "SparkDataFrame",
 #' df <- read.json(path)
 #' dtypes(df)
 #'}
+#' @note dtypes since 1.4.0
 setMethod("dtypes",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -259,6 +267,7 @@ setMethod("dtypes",
 #' columns(df)
 #' colnames(df)
 #'}
+#' @note columns since 1.4.0
 setMethod("columns",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -269,6 +278,7 @@ setMethod("columns",
 
 #' @rdname columns
 #' @name names
+#' @note names since 1.5.0
 setMethod("names",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -277,6 +287,7 @@ setMethod("names",
 
 #' @rdname columns
 #' @name names<-
+#' @note names<- since 1.5.0
 setMethod("names<-",
   signature(x = "SparkDataFrame"),
   function(x, value) {
@@ -288,6 +299,7 @@ setMethod("names<-",
 
 #' @rdname columns
 #' @name colnames
+#' @note colnames since 1.6.0
 setMethod("colnames",
   signature(x = "SparkDataFrame"),
   function(x) {
@@ -296,6 +308,7 @@ setMethod("colnames",
 
 #' @rdname columns
 #' @name colnames<-

spark git commit: [MINOR] Closing stale pull requests.

2016-06-20 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 359c2e827 -> 92514232e


[MINOR] Closing stale pull requests.

Closes #13114
Closes #10187
Closes #13432
Closes #13550

Author: Sean Owen 

Closes #13781 from srowen/CloseStalePR.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92514232
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92514232
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92514232

Branch: refs/heads/master
Commit: 92514232e52af0f5f0413ed97b9571b1b9daaa90
Parents: 359c2e8
Author: Sean Owen 
Authored: Mon Jun 20 22:12:55 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 20 22:12:55 2016 +0100

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 45c41aa33 -> f90b2ea1d


[SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example 
updates

## What changes were proposed in this pull request?

roxygen2 doc, programming guide, example updates

## How was this patch tested?

manual checks
shivaram

Author: Felix Cheung 

Closes #13751 from felixcheung/rsparksessiondoc.

(cherry picked from commit 359c2e827d5682249c009e83379a5ee8e5aa4e89)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f90b2ea1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f90b2ea1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f90b2ea1

Branch: refs/heads/branch-2.0
Commit: f90b2ea1d96bba4650b8d1ce37a60c81c89bca96
Parents: 45c41aa
Author: Felix Cheung 
Authored: Mon Jun 20 13:46:24 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 13:46:32 2016 -0700

--
 R/pkg/R/DataFrame.R | 169 +--
 R/pkg/R/SQLContext.R|  47 +++-
 R/pkg/R/mllib.R |   6 +-
 R/pkg/R/schema.R|  24 ++--
 R/pkg/R/sparkR.R|   7 +-
 docs/sparkr.md  |  99 
 examples/src/main/r/data-manipulation.R |  15 +--
 examples/src/main/r/dataframe.R |  13 +--
 examples/src/main/r/ml.R|  21 ++--
 9 files changed, 162 insertions(+), 239 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f90b2ea1/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index f3a3eff..583d3ae 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -35,12 +35,11 @@ setOldClass("structType")
 #' @slot env An R environment that stores bookkeeping states of the 
SparkDataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
 #' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
-#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
+#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe}
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' df <- createDataFrame(faithful)
 #'}
 setClass("SparkDataFrame",
@@ -77,8 +76,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' printSchema(df)
@@ -102,8 +100,7 @@ setMethod("printSchema",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dfSchema <- schema(df)
@@ -126,8 +123,7 @@ setMethod("schema",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' explain(df, TRUE)
@@ -157,8 +153,7 @@ setMethod("explain",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' isLocal(df)
@@ -182,8 +177,7 @@ setMethod("isLocal",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' showDF(df)
@@ -207,8 +201,7 @@ setMethod("showDF",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' df
@@ -234,8 +227,7 @@ setMethod("show", "SparkDataFrame",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dtypes(df)
@@ -261,8 +253,7 @@ setMethod("dtypes",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' columns(df)
@@ -396,8 +387,7 @@ setMethod("coltypes",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' coltypes(df) <- c("character", "integer")
@@ -432,7 +422,7 @@ setMethod("coltypes<-",
 
 #' Creates a temporary view using the given name.
 #'
-#' Creates a new temporary view using a SparkDataFrame in the SQLContex

spark git commit: [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master b0f2fb5b9 -> 359c2e827


[SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example 
updates

## What changes were proposed in this pull request?

roxygen2 doc, programming guide, example updates

## How was this patch tested?

manual checks
shivaram

Author: Felix Cheung 

Closes #13751 from felixcheung/rsparksessiondoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/359c2e82
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/359c2e82
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/359c2e82

Branch: refs/heads/master
Commit: 359c2e827d5682249c009e83379a5ee8e5aa4e89
Parents: b0f2fb5
Author: Felix Cheung 
Authored: Mon Jun 20 13:46:24 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 13:46:24 2016 -0700

--
 R/pkg/R/DataFrame.R | 169 +--
 R/pkg/R/SQLContext.R|  47 +++-
 R/pkg/R/mllib.R |   6 +-
 R/pkg/R/schema.R|  24 ++--
 R/pkg/R/sparkR.R|   7 +-
 docs/sparkr.md  |  99 
 examples/src/main/r/data-manipulation.R |  15 +--
 examples/src/main/r/dataframe.R |  13 +--
 examples/src/main/r/ml.R|  21 ++--
 9 files changed, 162 insertions(+), 239 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/359c2e82/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index f3a3eff..583d3ae 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -35,12 +35,11 @@ setOldClass("structType")
 #' @slot env An R environment that stores bookkeeping states of the 
SparkDataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
 #' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
-#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
+#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe}
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' df <- createDataFrame(faithful)
 #'}
 setClass("SparkDataFrame",
@@ -77,8 +76,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' printSchema(df)
@@ -102,8 +100,7 @@ setMethod("printSchema",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dfSchema <- schema(df)
@@ -126,8 +123,7 @@ setMethod("schema",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' explain(df, TRUE)
@@ -157,8 +153,7 @@ setMethod("explain",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' isLocal(df)
@@ -182,8 +177,7 @@ setMethod("isLocal",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' showDF(df)
@@ -207,8 +201,7 @@ setMethod("showDF",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' df
@@ -234,8 +227,7 @@ setMethod("show", "SparkDataFrame",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dtypes(df)
@@ -261,8 +253,7 @@ setMethod("dtypes",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' columns(df)
@@ -396,8 +387,7 @@ setMethod("coltypes",
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' sqlContext <- sparkRSQL.init(sc)
+#' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' coltypes(df) <- c("character", "integer")
@@ -432,7 +422,7 @@ setMethod("coltypes<-",
 
 #' Creates a temporary view using the given name.
 #'
-#' Creates a new temporary view using a SparkDataFrame in the SQLContext. If a
+#' Creates a new temporary view using a SparkDataFrame in the Spark Session. 
If a
 #' temporary view with

spark git commit: [SPARK-16053][R] Add `spark_partition_id` in SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 dfa920204 -> 45c41aa33


[SPARK-16053][R] Add `spark_partition_id` in SparkR

## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API 
parity.

The following is just an example to illustrate a SparkR usage on a partitioned 
parquet table created by 
`spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id(
   id SPARK_PARTITION_ID()
1   30
2   40
3   81
4   91
5   02
6   13
7   24
8   55
9   66
10  77
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun 

Closes #13768 from dongjoon-hyun/SPARK-16053.

(cherry picked from commit b0f2fb5b9729b38744bf784f2072f5ee52314f87)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45c41aa3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45c41aa3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45c41aa3

Branch: refs/heads/branch-2.0
Commit: 45c41aa33b39bfc38b8615fde044356a590edcfb
Parents: dfa9202
Author: Dongjoon Hyun 
Authored: Mon Jun 20 13:41:03 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 13:41:11 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 21 +
 R/pkg/R/generics.R|  4 
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  1 +
 4 files changed, 27 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index aaeab66..45663f4 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -260,6 +260,7 @@ exportMethods("%in%",
   "skewness",
   "sort_array",
   "soundex",
+  "spark_partition_id",
   "stddev",
   "stddev_pop",
   "stddev_samp",

http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 0fb38bc..c26f963 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -1206,6 +1206,27 @@ setMethod("soundex",
 column(jc)
   })
 
+#' Return the partition ID as a column
+#'
+#' Return the partition ID of the Spark task as a SparkDataFrame column.
+#' Note that this is nondeterministic because it depends on data partitioning 
and
+#' task scheduling.
+#'
+#' This is equivalent to the SPARK_PARTITION_ID function in SQL.
+#'
+#' @rdname spark_partition_id
+#' @name spark_partition_id
+#' @export
+#' @examples
+#' \dontrun{select(df, spark_partition_id())}
+#' @note spark_partition_id since 2.0.0
+setMethod("spark_partition_id",
+  signature(x = "missing"),
+  function() {
+jc <- callJStatic("org.apache.spark.sql.functions", 
"spark_partition_id")
+column(jc)
+  })
+
 #' @rdname sd
 #' @name stddev
 setMethod("stddev",

http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index dcc1cf2..f6b9276 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1135,6 +1135,10 @@ setGeneric("sort_array", function(x, asc = TRUE) { 
standardGeneric("sort_array")
 #' @export
 setGeneric("soundex", function(x) { standardGeneric("soundex") })
 
+#' @rdname spark_partition_id
+#' @export
+setGeneric("spark_partition_id", function(x) { 
standardGeneric("spark_partition_id") })
+
 #' @rdname sd
 #' @export
 setGeneric("stddev", function(x) { standardGeneric("stddev") })

http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index 114fec6..d53c40d 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -1059,6 +1059,7 @@ test_that("column functions", {
   c16 <- is.nan(c) + isnan(c) + isNaN(c)
   c17 <- cov(c, c1) + cov("c", "c1") + covar_samp(c, c1) + covar_samp("c", 
"c1")
   c18 <- covar_pop(c, c1) + covar_pop("c", "c1")
+  c19 <- spark_partition_id()
 
   # Test if base::is.nan()

spark git commit: [SPARK-16053][R] Add `spark_partition_id` in SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master aee1420ec -> b0f2fb5b9


[SPARK-16053][R] Add `spark_partition_id` in SparkR

## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API 
parity.

The following is just an example to illustrate a SparkR usage on a partitioned 
parquet table created by 
`spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id(
   id SPARK_PARTITION_ID()
1   30
2   40
3   81
4   91
5   02
6   13
7   24
8   55
9   66
10  77
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun 

Closes #13768 from dongjoon-hyun/SPARK-16053.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0f2fb5b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0f2fb5b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0f2fb5b

Branch: refs/heads/master
Commit: b0f2fb5b9729b38744bf784f2072f5ee52314f87
Parents: aee1420
Author: Dongjoon Hyun 
Authored: Mon Jun 20 13:41:03 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 13:41:03 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 21 +
 R/pkg/R/generics.R|  4 
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  1 +
 4 files changed, 27 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index aaeab66..45663f4 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -260,6 +260,7 @@ exportMethods("%in%",
   "skewness",
   "sort_array",
   "soundex",
+  "spark_partition_id",
   "stddev",
   "stddev_pop",
   "stddev_samp",

http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 0fb38bc..c26f963 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -1206,6 +1206,27 @@ setMethod("soundex",
 column(jc)
   })
 
+#' Return the partition ID as a column
+#'
+#' Return the partition ID of the Spark task as a SparkDataFrame column.
+#' Note that this is nondeterministic because it depends on data partitioning 
and
+#' task scheduling.
+#'
+#' This is equivalent to the SPARK_PARTITION_ID function in SQL.
+#'
+#' @rdname spark_partition_id
+#' @name spark_partition_id
+#' @export
+#' @examples
+#' \dontrun{select(df, spark_partition_id())}
+#' @note spark_partition_id since 2.0.0
+setMethod("spark_partition_id",
+  signature(x = "missing"),
+  function() {
+jc <- callJStatic("org.apache.spark.sql.functions", 
"spark_partition_id")
+column(jc)
+  })
+
 #' @rdname sd
 #' @name stddev
 setMethod("stddev",

http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index dcc1cf2..f6b9276 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1135,6 +1135,10 @@ setGeneric("sort_array", function(x, asc = TRUE) { 
standardGeneric("sort_array")
 #' @export
 setGeneric("soundex", function(x) { standardGeneric("soundex") })
 
+#' @rdname spark_partition_id
+#' @export
+setGeneric("spark_partition_id", function(x) { 
standardGeneric("spark_partition_id") })
+
 #' @rdname sd
 #' @export
 setGeneric("stddev", function(x) { standardGeneric("stddev") })

http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index 114fec6..d53c40d 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -1059,6 +1059,7 @@ test_that("column functions", {
   c16 <- is.nan(c) + isnan(c) + isNaN(c)
   c17 <- cov(c, c1) + cov("c", "c1") + covar_samp(c, c1) + covar_samp("c", 
"c1")
   c18 <- covar_pop(c, c1) + covar_pop("c", "c1")
+  c19 <- spark_partition_id()
 
   # Test if base::is.nan() is exposed
   expect_equal(is.nan(c("a", "b")), c(FALSE, FALSE))

spark git commit: [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time

2016-06-20 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 16b7f1dfc -> db86e7fd2


[SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight 
Saving Time

Internally, we use Int to represent a date (the days since 1970-01-01), when we 
convert that into unix timestamp (milli-seconds since epoch in UTC), we get the 
offset of a timezone using local millis (the milli-seconds since 1970-01-01 in 
a timezone), but TimeZone.getOffset() expect unix timestamp, the result could 
be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the 
offset. In the event of changing of DST, Some time is not defined (for example, 
2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for 
example, 2016-11-06 01:00:00), this best effort approximate should be enough in 
practice.

Added regression tests.

Author: Davies Liu 

Closes #13652 from davies/fix_timezone.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db86e7fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db86e7fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db86e7fd

Branch: refs/heads/branch-1.6
Commit: db86e7fd263ca4e24cf8faad95fca3189bab2fb0
Parents: 16b7f1d
Author: Davies Liu 
Authored: Sun Jun 19 00:34:52 2016 -0700
Committer: Davies Liu 
Committed: Tue Jun 21 04:38:16 2016 +0800

--
 .../spark/sql/catalyst/util/DateTimeUtils.scala | 50 ++--
 .../org/apache/spark/sql/types/DateType.scala   |  2 +-
 .../sql/catalyst/util/DateTimeTestUtils.scala   | 40 
 .../sql/catalyst/util/DateTimeUtilsSuite.scala  | 40 
 4 files changed, 128 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/db86e7fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index 2b93882..157ac2b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -89,8 +89,8 @@ object DateTimeUtils {
 
   // reverse of millisToDays
   def daysToMillis(days: SQLDate): Long = {
-val millisUtc = days.toLong * MILLIS_PER_DAY
-millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
+val millisLocal = days.toLong * MILLIS_PER_DAY
+millisLocal - getOffsetFromLocalMillis(millisLocal, 
threadLocalLocalTimeZone.get())
   }
 
   def dateToString(days: SQLDate): String =
@@ -820,6 +820,41 @@ object DateTimeUtils {
   }
 
   /**
+   * Lookup the offset for given millis seconds since 1970-01-01 00:00:00 in 
given timezone.
+   */
+  private def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long 
= {
+var guess = tz.getRawOffset
+// the actual offset should be calculated based on milliseconds in UTC
+val offset = tz.getOffset(millisLocal - guess)
+if (offset != guess) {
+  guess = tz.getOffset(millisLocal - offset)
+  if (guess != offset) {
+// fallback to do the reverse lookup using java.sql.Timestamp
+// this should only happen near the start or end of DST
+val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
+val year = getYear(days)
+val month = getMonth(days)
+val day = getDayOfMonth(days)
+
+var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
+if (millisOfDay < 0) {
+  millisOfDay += MILLIS_PER_DAY.toInt
+}
+val seconds = (millisOfDay / 1000L).toInt
+val hh = seconds / 3600
+val mm = seconds / 60 % 60
+val ss = seconds % 60
+val nano = millisOfDay % 1000 * 100
+
+// create a Timestamp to get the unix timestamp (in UTC)
+val timestamp = new Timestamp(year - 1900, month - 1, day, hh, mm, ss, 
nano)
+guess = (millisLocal - timestamp.getTime).toInt
+  }
+}
+guess
+  }
+
+  /**
* Returns a timestamp of given timezone from utc timestamp, with the same 
string
* representation in their timezone.
*/
@@ -835,7 +870,16 @@ object DateTimeUtils {
*/
   def toUTCTime(time: SQLTimestamp, timeZone: String): SQLTimestamp = {
 val tz = TimeZone.getTimeZone(timeZone)
-val offset = tz.getOffset(time / 1000L)
+val offset = getOffsetFromLocalMillis(time / 1000L, tz)
 time - offset * 1000L
   }
+
+  /**
+   * Re-initialize the current thread's thread locals. Exposed for testing.
+   */
+  private[util] def resetThreadLocals(): Unit = {
+th

spark git commit: [SPARKR] fix R roxygen2 doc for count on GroupedData

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d2c94e6a4 -> dfa920204


[SPARKR] fix R roxygen2 doc for count on GroupedData

## What changes were proposed in this pull request?
fix code doc

## How was this patch tested?

manual

shivaram

Author: Felix Cheung 

Closes #13782 from felixcheung/rcountdoc.

(cherry picked from commit aee1420eca64dfc145f31b8c653388fafc5ccd8f)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dfa92020
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dfa92020
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dfa92020

Branch: refs/heads/branch-2.0
Commit: dfa920204e3407c38df9012ca42b7b56c416a5b3
Parents: d2c94e6
Author: Felix Cheung 
Authored: Mon Jun 20 12:31:00 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 12:31:08 2016 -0700

--
 R/pkg/R/group.R | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dfa92020/R/pkg/R/group.R
--
diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R
index eba083f..65b9e84 100644
--- a/R/pkg/R/group.R
+++ b/R/pkg/R/group.R
@@ -58,7 +58,7 @@ setMethod("show", "GroupedData",
 #'
 #' @param x a GroupedData
 #' @return a SparkDataFrame
-#' @rdname agg
+#' @rdname count
 #' @export
 #' @examples
 #' \dontrun{
@@ -83,6 +83,7 @@ setMethod("count",
 #' @rdname summarize
 #' @name agg
 #' @family agg_funcs
+#' @export
 #' @examples
 #' \dontrun{
 #'  df2 <- agg(df, age = "sum")  # new column name will be created as 
'SUM(age#0)'
@@ -160,6 +161,7 @@ createMethods()
 #' @return a SparkDataFrame
 #' @rdname gapply
 #' @name gapply
+#' @export
 #' @examples
 #' \dontrun{
 #' Computes the arithmetic mean of the second column by grouping


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARKR] fix R roxygen2 doc for count on GroupedData

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 46d98e0a1 -> aee1420ec


[SPARKR] fix R roxygen2 doc for count on GroupedData

## What changes were proposed in this pull request?
fix code doc

## How was this patch tested?

manual

shivaram

Author: Felix Cheung 

Closes #13782 from felixcheung/rcountdoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aee1420e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aee1420e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aee1420e

Branch: refs/heads/master
Commit: aee1420eca64dfc145f31b8c653388fafc5ccd8f
Parents: 46d98e0
Author: Felix Cheung 
Authored: Mon Jun 20 12:31:00 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 12:31:00 2016 -0700

--
 R/pkg/R/group.R | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aee1420e/R/pkg/R/group.R
--
diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R
index eba083f..65b9e84 100644
--- a/R/pkg/R/group.R
+++ b/R/pkg/R/group.R
@@ -58,7 +58,7 @@ setMethod("show", "GroupedData",
 #'
 #' @param x a GroupedData
 #' @return a SparkDataFrame
-#' @rdname agg
+#' @rdname count
 #' @export
 #' @examples
 #' \dontrun{
@@ -83,6 +83,7 @@ setMethod("count",
 #' @rdname summarize
 #' @name agg
 #' @family agg_funcs
+#' @export
 #' @examples
 #' \dontrun{
 #'  df2 <- agg(df, age = "sum")  # new column name will be created as 
'SUM(age#0)'
@@ -160,6 +161,7 @@ createMethods()
 #' @return a SparkDataFrame
 #' @rdname gapply
 #' @name gapply
+#' @export
 #' @examples
 #' \dontrun{
 #' Computes the arithmetic mean of the second column by grouping


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16028][SPARKR] spark.lapply can work with active context

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ead872e49 -> d2c94e6a4


[SPARK-16028][SPARKR] spark.lapply can work with active context

## What changes were proposed in this pull request?

spark.lapply and setLogLevel

## How was this patch tested?

unit test

shivaram thunterdb

Author: Felix Cheung 

Closes #13752 from felixcheung/rlapply.

(cherry picked from commit 46d98e0a1f40a4c6ae92253c5c498a3a924497fc)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2c94e6a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2c94e6a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2c94e6a

Branch: refs/heads/branch-2.0
Commit: d2c94e6a45090cf545fe1e243f3dfde5ed87b4d0
Parents: ead872e
Author: Felix Cheung 
Authored: Mon Jun 20 12:08:42 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 12:08:49 2016 -0700

--
 R/pkg/R/context.R| 20 +---
 R/pkg/inst/tests/testthat/test_context.R |  6 +++---
 2 files changed, 16 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2c94e6a/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 5c88603..968a9d2 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -252,17 +252,20 @@ setCheckpointDir <- function(sc, dirName) {
 #' }
 #'
 #' @rdname spark.lapply
-#' @param sc Spark Context to use
 #' @param list the list of elements
 #' @param func a function that takes one argument.
 #' @return a list of results (the exact type being determined by the function)
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' doubled <- spark.lapply(sc, 1:10, function(x){2 * x})
+#' sparkR.session()
+#' doubled <- spark.lapply(1:10, function(x){2 * x})
 #'}
-spark.lapply <- function(sc, list, func) {
+spark.lapply <- function(list, func) {
+  if (!exists(".sparkRjsc", envir = .sparkREnv)) {
+stop("SparkR has not been initialized. Please call sparkR.session()")
+  }
+  sc <- get(".sparkRjsc", envir = .sparkREnv)
   rdd <- parallelize(sc, list, length(list))
   results <- map(rdd, func)
   local <- collect(results)
@@ -274,14 +277,17 @@ spark.lapply <- function(sc, list, func) {
 #' Set new log level: "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", 
"TRACE", "WARN"
 #'
 #' @rdname setLogLevel
-#' @param sc Spark Context to use
 #' @param level New log level
 #' @export
 #' @examples
 #'\dontrun{
-#' setLogLevel(sc, "ERROR")
+#' setLogLevel("ERROR")
 #'}
 
-setLogLevel <- function(sc, level) {
+setLogLevel <- function(level) {
+  if (!exists(".sparkRjsc", envir = .sparkREnv)) {
+stop("SparkR has not been initialized. Please call sparkR.session()")
+  }
+  sc <- get(".sparkRjsc", envir = .sparkREnv)
   callJMethod(sc, "setLogLevel", level)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/d2c94e6a/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index f123187..b149818 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -107,8 +107,8 @@ test_that("job group functions can be called", {
 })
 
 test_that("utility function can be called", {
-  sc <- sparkR.sparkContext()
-  setLogLevel(sc, "ERROR")
+  sparkR.sparkContext()
+  setLogLevel("ERROR")
   sparkR.session.stop()
 })
 
@@ -161,7 +161,7 @@ test_that("sparkJars sparkPackages as comma-separated 
strings", {
 
 test_that("spark.lapply should perform simple transforms", {
   sc <- sparkR.sparkContext()
-  doubled <- spark.lapply(sc, 1:10, function(x) { 2 * x })
+  doubled <- spark.lapply(1:10, function(x) { 2 * x })
   expect_equal(doubled, as.list(2 * 1:10))
   sparkR.session.stop()
 })


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16028][SPARKR] spark.lapply can work with active context

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master c44bf137c -> 46d98e0a1


[SPARK-16028][SPARKR] spark.lapply can work with active context

## What changes were proposed in this pull request?

spark.lapply and setLogLevel

## How was this patch tested?

unit test

shivaram thunterdb

Author: Felix Cheung 

Closes #13752 from felixcheung/rlapply.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46d98e0a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46d98e0a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46d98e0a

Branch: refs/heads/master
Commit: 46d98e0a1f40a4c6ae92253c5c498a3a924497fc
Parents: c44bf13
Author: Felix Cheung 
Authored: Mon Jun 20 12:08:42 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 12:08:42 2016 -0700

--
 R/pkg/R/context.R| 20 +---
 R/pkg/inst/tests/testthat/test_context.R |  6 +++---
 2 files changed, 16 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/46d98e0a/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 5c88603..968a9d2 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -252,17 +252,20 @@ setCheckpointDir <- function(sc, dirName) {
 #' }
 #'
 #' @rdname spark.lapply
-#' @param sc Spark Context to use
 #' @param list the list of elements
 #' @param func a function that takes one argument.
 #' @return a list of results (the exact type being determined by the function)
 #' @export
 #' @examples
 #'\dontrun{
-#' sc <- sparkR.init()
-#' doubled <- spark.lapply(sc, 1:10, function(x){2 * x})
+#' sparkR.session()
+#' doubled <- spark.lapply(1:10, function(x){2 * x})
 #'}
-spark.lapply <- function(sc, list, func) {
+spark.lapply <- function(list, func) {
+  if (!exists(".sparkRjsc", envir = .sparkREnv)) {
+stop("SparkR has not been initialized. Please call sparkR.session()")
+  }
+  sc <- get(".sparkRjsc", envir = .sparkREnv)
   rdd <- parallelize(sc, list, length(list))
   results <- map(rdd, func)
   local <- collect(results)
@@ -274,14 +277,17 @@ spark.lapply <- function(sc, list, func) {
 #' Set new log level: "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", 
"TRACE", "WARN"
 #'
 #' @rdname setLogLevel
-#' @param sc Spark Context to use
 #' @param level New log level
 #' @export
 #' @examples
 #'\dontrun{
-#' setLogLevel(sc, "ERROR")
+#' setLogLevel("ERROR")
 #'}
 
-setLogLevel <- function(sc, level) {
+setLogLevel <- function(level) {
+  if (!exists(".sparkRjsc", envir = .sparkREnv)) {
+stop("SparkR has not been initialized. Please call sparkR.session()")
+  }
+  sc <- get(".sparkRjsc", envir = .sparkREnv)
   callJMethod(sc, "setLogLevel", level)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/46d98e0a/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index f123187..b149818 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -107,8 +107,8 @@ test_that("job group functions can be called", {
 })
 
 test_that("utility function can be called", {
-  sc <- sparkR.sparkContext()
-  setLogLevel(sc, "ERROR")
+  sparkR.sparkContext()
+  setLogLevel("ERROR")
   sparkR.session.stop()
 })
 
@@ -161,7 +161,7 @@ test_that("sparkJars sparkPackages as comma-separated 
strings", {
 
 test_that("spark.lapply should perform simple transforms", {
   sc <- sparkR.sparkContext()
-  doubled <- spark.lapply(sc, 1:10, function(x) { 2 * x })
+  doubled <- spark.lapply(1:10, function(x) { 2 * x })
   expect_equal(doubled, as.list(2 * 1:10))
   sparkR.session.stop()
 })


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16051][R] Add `read.orc/write.orc` to SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 36e812d4b -> c44bf137c


[SPARK-16051][R] Add `read.orc/write.orc` to SparkR

## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun 

Closes #13763 from dongjoon-hyun/SPARK-16051.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c44bf137
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c44bf137
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c44bf137

Branch: refs/heads/master
Commit: c44bf137c7ca649e0c504229eb3e6ff7955e9a53
Parents: 36e812d
Author: Dongjoon Hyun 
Authored: Mon Jun 20 11:30:26 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:30:26 2016 -0700

--
 R/pkg/NAMESPACE   |  2 ++
 R/pkg/R/DataFrame.R   | 27 ++
 R/pkg/R/SQLContext.R  | 21 +++-
 R/pkg/R/generics.R|  4 
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 21 
 5 files changed, 74 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index cc129a7..aaeab66 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -117,6 +117,7 @@ exportMethods("arrange",
   "write.df",
   "write.jdbc",
   "write.json",
+  "write.orc",
   "write.parquet",
   "write.text",
   "write.ml")
@@ -306,6 +307,7 @@ export("as.DataFrame",
"read.df",
"read.jdbc",
"read.json",
+   "read.orc",
"read.parquet",
"read.text",
"spark.lapply",

http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index ea091c8..f3a3eff 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -701,6 +701,33 @@ setMethod("write.json",
 invisible(callJMethod(write, "json", path))
   })
 
+#' Save the contents of SparkDataFrame as an ORC file, preserving the schema.
+#'
+#' Save the contents of a SparkDataFrame as an ORC file, preserving the 
schema. Files written out
+#' with this method can be read back in as a SparkDataFrame using read.orc().
+#'
+#' @param x A SparkDataFrame
+#' @param path The directory where the file is saved
+#'
+#' @family SparkDataFrame functions
+#' @rdname write.orc
+#' @name write.orc
+#' @export
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.json"
+#' df <- read.json(path)
+#' write.orc(df, "/tmp/sparkr-tmp1/")
+#' }
+#' @note write.orc since 2.0.0
+setMethod("write.orc",
+  signature(x = "SparkDataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "orc", path))
+  })
+
 #' Save the contents of SparkDataFrame as a Parquet file, preserving the 
schema.
 #'
 #' Save the contents of a SparkDataFrame as a Parquet file, preserving the 
schema. Files written out

http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/R/SQLContext.R
--
diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R
index b0ccc42..b7e1c06 100644
--- a/R/pkg/R/SQLContext.R
+++ b/R/pkg/R/SQLContext.R
@@ -330,6 +330,25 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, 
samplingRatio = 1.0) {
   }
 }
 
+#' Create a SparkDataFrame from an ORC file.
+#'
+#' Loads an ORC file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read.
+#' @return SparkDataFrame
+#' @rdname read.orc
+#' @export
+#' @name read.orc
+#' @note read.orc since 2.0.0
+read.orc <- function(path) {
+  sparkSession <- getSparkSession()
+  # Allow the user to have a more flexible definiton of the ORC file path
+  path <- suppressWarnings(normalizePath(path))
+  read <- callJMethod(sparkSession, "read")
+  sdf <- callJMethod(read, "orc", path)
+  dataFrame(sdf)
+}
+
 #' Create a SparkDataFrame from a Parquet file.
 #'
 #' Loads a Parquet file, returning the result as a SparkDataFrame.
@@ -343,7 +362,7 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, 
samplingRatio = 1.0) {
 
 read.parquet.default <- function(path) {
   sparkSession <- getSparkSession()
-  # Allow the user to have a more flexible definiton of the text file path
+  # Allow the user to have a more flexible definiton of the Parquet file path
   pa

spark git commit: [SPARK-16051][R] Add `read.orc/write.orc` to SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 5b22e34e9 -> ead872e49


[SPARK-16051][R] Add `read.orc/write.orc` to SparkR

## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun 

Closes #13763 from dongjoon-hyun/SPARK-16051.

(cherry picked from commit c44bf137c7ca649e0c504229eb3e6ff7955e9a53)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ead872e4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ead872e4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ead872e4

Branch: refs/heads/branch-2.0
Commit: ead872e4996ad0c0b02debd1ab829ff67b79abfb
Parents: 5b22e34
Author: Dongjoon Hyun 
Authored: Mon Jun 20 11:30:26 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:30:36 2016 -0700

--
 R/pkg/NAMESPACE   |  2 ++
 R/pkg/R/DataFrame.R   | 27 ++
 R/pkg/R/SQLContext.R  | 21 +++-
 R/pkg/R/generics.R|  4 
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 21 
 5 files changed, 74 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index cc129a7..aaeab66 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -117,6 +117,7 @@ exportMethods("arrange",
   "write.df",
   "write.jdbc",
   "write.json",
+  "write.orc",
   "write.parquet",
   "write.text",
   "write.ml")
@@ -306,6 +307,7 @@ export("as.DataFrame",
"read.df",
"read.jdbc",
"read.json",
+   "read.orc",
"read.parquet",
"read.text",
"spark.lapply",

http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index ea091c8..f3a3eff 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -701,6 +701,33 @@ setMethod("write.json",
 invisible(callJMethod(write, "json", path))
   })
 
+#' Save the contents of SparkDataFrame as an ORC file, preserving the schema.
+#'
+#' Save the contents of a SparkDataFrame as an ORC file, preserving the 
schema. Files written out
+#' with this method can be read back in as a SparkDataFrame using read.orc().
+#'
+#' @param x A SparkDataFrame
+#' @param path The directory where the file is saved
+#'
+#' @family SparkDataFrame functions
+#' @rdname write.orc
+#' @name write.orc
+#' @export
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.json"
+#' df <- read.json(path)
+#' write.orc(df, "/tmp/sparkr-tmp1/")
+#' }
+#' @note write.orc since 2.0.0
+setMethod("write.orc",
+  signature(x = "SparkDataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "orc", path))
+  })
+
 #' Save the contents of SparkDataFrame as a Parquet file, preserving the 
schema.
 #'
 #' Save the contents of a SparkDataFrame as a Parquet file, preserving the 
schema. Files written out

http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/R/SQLContext.R
--
diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R
index b0ccc42..b7e1c06 100644
--- a/R/pkg/R/SQLContext.R
+++ b/R/pkg/R/SQLContext.R
@@ -330,6 +330,25 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, 
samplingRatio = 1.0) {
   }
 }
 
+#' Create a SparkDataFrame from an ORC file.
+#'
+#' Loads an ORC file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read.
+#' @return SparkDataFrame
+#' @rdname read.orc
+#' @export
+#' @name read.orc
+#' @note read.orc since 2.0.0
+read.orc <- function(path) {
+  sparkSession <- getSparkSession()
+  # Allow the user to have a more flexible definiton of the ORC file path
+  path <- suppressWarnings(normalizePath(path))
+  read <- callJMethod(sparkSession, "read")
+  sdf <- callJMethod(read, "orc", path)
+  dataFrame(sdf)
+}
+
 #' Create a SparkDataFrame from a Parquet file.
 #'
 #' Loads a Parquet file, returning the result as a SparkDataFrame.
@@ -343,7 +362,7 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, 
samplingRatio = 1.0) {
 
 read.parquet.default <- function(path) {
   sparkSession <- getSparkSession()
-  # Allow the user to have a more flexible

spark git commit: [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 bb80d1c24 -> 5b22e34e9


[SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable

## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung 

Closes #13753 from felixcheung/rdroptempview.

(cherry picked from commit 36e812d4b695566437c6bac991ef06a0f81fb1c5)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5b22e34e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5b22e34e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5b22e34e

Branch: refs/heads/branch-2.0
Commit: 5b22e34e96f7795a0e8d547eba2229b60f999fa5
Parents: bb80d1c
Author: Felix Cheung 
Authored: Mon Jun 20 11:24:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:24:48 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/SQLContext.R  | 39 ++
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 14 -
 3 files changed, 41 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 0cfe190..cc129a7 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -299,6 +299,7 @@ export("as.DataFrame",
"createDataFrame",
"createExternalTable",
"dropTempTable",
+   "dropTempView",
"jsonFile",
"loadDF",
"parquetFile",

http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/R/SQLContext.R
--
diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R
index 3232241..b0ccc42 100644
--- a/R/pkg/R/SQLContext.R
+++ b/R/pkg/R/SQLContext.R
@@ -599,13 +599,14 @@ clearCache <- function() {
   dispatchFunc("clearCache()")
 }
 
-#' Drop Temporary Table
+#' (Deprecated) Drop Temporary Table
 #'
 #' Drops the temporary table with the given table name in the catalog.
 #' If the table has been cached/persisted before, it's also unpersisted.
 #'
 #' @param tableName The name of the SparkSQL table to be dropped.
-#' @rdname dropTempTable
+#' @seealso \link{dropTempView}
+#' @rdname dropTempTable-deprecated
 #' @export
 #' @examples
 #' \dontrun{
@@ -619,16 +620,42 @@ clearCache <- function() {
 #' @method dropTempTable default
 
 dropTempTable.default <- function(tableName) {
-  sparkSession <- getSparkSession()
   if (class(tableName) != "character") {
 stop("tableName must be a string.")
   }
-  catalog <- callJMethod(sparkSession, "catalog")
-  callJMethod(catalog, "dropTempView", tableName)
+  dropTempView(tableName)
 }
 
 dropTempTable <- function(x, ...) {
-  dispatchFunc("dropTempTable(tableName)", x, ...)
+  .Deprecated("dropTempView")
+  dispatchFunc("dropTempView(viewName)", x, ...)
+}
+
+#' Drops the temporary view with the given view name in the catalog.
+#'
+#' Drops the temporary view with the given view name in the catalog.
+#' If the view has been cached before, then it will also be uncached.
+#'
+#' @param viewName the name of the view to be dropped.
+#' @rdname dropTempView
+#' @name dropTempView
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' df <- read.df(path, "parquet")
+#' createOrReplaceTempView(df, "table")
+#' dropTempView("table")
+#' }
+#' @note since 2.0.0
+
+dropTempView <- function(viewName) {
+  sparkSession <- getSparkSession()
+  if (class(viewName) != "character") {
+stop("viewName must be a string.")
+  }
+  catalog <- callJMethod(sparkSession, "catalog")
+  callJMethod(catalog, "dropTempView", viewName)
 }
 
 #' Load a SparkDataFrame

http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index c5c5a06..ceba0d1 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -472,8 +472,8 @@ test_that("test tableNames and tables", {
   suppressWarnings(registerTempTable(df, "table2"))
   tables <- tables()
   expect_equal(count(tables), 2)
-  dropTempTable("table1")
-  dropTempTable("table2")
+  suppressWarnings(dropTempTable("table1"))
+  dropTempView("table2")
 
   tables <- tables()
   expect_equal(count(tables), 0)
@@ -486,7 +486,7 @@ test_that(
   newdf <- sql("SELECT * FROM table1 where name = 'Michael'")
   expect_is(newdf, "SparkDataFrame")
   expect_equal(count(newdf), 1)
-  dropTempTable("table1")
+  dropTempView("table1")

spark git commit: [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 961342489 -> 36e812d4b


[SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable

## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung 

Closes #13753 from felixcheung/rdroptempview.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36e812d4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36e812d4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36e812d4

Branch: refs/heads/master
Commit: 36e812d4b695566437c6bac991ef06a0f81fb1c5
Parents: 9613424
Author: Felix Cheung 
Authored: Mon Jun 20 11:24:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:24:41 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/SQLContext.R  | 39 ++
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 14 -
 3 files changed, 41 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 0cfe190..cc129a7 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -299,6 +299,7 @@ export("as.DataFrame",
"createDataFrame",
"createExternalTable",
"dropTempTable",
+   "dropTempView",
"jsonFile",
"loadDF",
"parquetFile",

http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/R/SQLContext.R
--
diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R
index 3232241..b0ccc42 100644
--- a/R/pkg/R/SQLContext.R
+++ b/R/pkg/R/SQLContext.R
@@ -599,13 +599,14 @@ clearCache <- function() {
   dispatchFunc("clearCache()")
 }
 
-#' Drop Temporary Table
+#' (Deprecated) Drop Temporary Table
 #'
 #' Drops the temporary table with the given table name in the catalog.
 #' If the table has been cached/persisted before, it's also unpersisted.
 #'
 #' @param tableName The name of the SparkSQL table to be dropped.
-#' @rdname dropTempTable
+#' @seealso \link{dropTempView}
+#' @rdname dropTempTable-deprecated
 #' @export
 #' @examples
 #' \dontrun{
@@ -619,16 +620,42 @@ clearCache <- function() {
 #' @method dropTempTable default
 
 dropTempTable.default <- function(tableName) {
-  sparkSession <- getSparkSession()
   if (class(tableName) != "character") {
 stop("tableName must be a string.")
   }
-  catalog <- callJMethod(sparkSession, "catalog")
-  callJMethod(catalog, "dropTempView", tableName)
+  dropTempView(tableName)
 }
 
 dropTempTable <- function(x, ...) {
-  dispatchFunc("dropTempTable(tableName)", x, ...)
+  .Deprecated("dropTempView")
+  dispatchFunc("dropTempView(viewName)", x, ...)
+}
+
+#' Drops the temporary view with the given view name in the catalog.
+#'
+#' Drops the temporary view with the given view name in the catalog.
+#' If the view has been cached before, then it will also be uncached.
+#'
+#' @param viewName the name of the view to be dropped.
+#' @rdname dropTempView
+#' @name dropTempView
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' df <- read.df(path, "parquet")
+#' createOrReplaceTempView(df, "table")
+#' dropTempView("table")
+#' }
+#' @note since 2.0.0
+
+dropTempView <- function(viewName) {
+  sparkSession <- getSparkSession()
+  if (class(viewName) != "character") {
+stop("viewName must be a string.")
+  }
+  catalog <- callJMethod(sparkSession, "catalog")
+  callJMethod(catalog, "dropTempView", viewName)
 }
 
 #' Load a SparkDataFrame

http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index c5c5a06..ceba0d1 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -472,8 +472,8 @@ test_that("test tableNames and tables", {
   suppressWarnings(registerTempTable(df, "table2"))
   tables <- tables()
   expect_equal(count(tables), 2)
-  dropTempTable("table1")
-  dropTempTable("table2")
+  suppressWarnings(dropTempTable("table1"))
+  dropTempView("table2")
 
   tables <- tables()
   expect_equal(count(tables), 0)
@@ -486,7 +486,7 @@ test_that(
   newdf <- sql("SELECT * FROM table1 where name = 'Michael'")
   expect_is(newdf, "SparkDataFrame")
   expect_equal(count(newdf), 1)
-  dropTempTable("table1")
+  dropTempView("table1")
 })
 
 test_that("test cache, uncache and clearCache", {
@@ -495,7 +495,7 @@ test_that("test cache, uncache and clea

spark git commit: [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 363db9f8b -> bb80d1c24


[SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR

## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API 
parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()name age
1 0 Michael  NA
2 1Andy  30
3 2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun 

Closes #13774 from dongjoon-hyun/SPARK-16059.

(cherry picked from commit 9613424898fd2a586156bc4eb48e255749774f20)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb80d1c2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb80d1c2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb80d1c2

Branch: refs/heads/branch-2.0
Commit: bb80d1c24a633ceb4ad63b1fa8c02c66d79b2540
Parents: 363db9f
Author: Dongjoon Hyun 
Authored: Mon Jun 20 11:12:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:12:51 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 27 ++
 R/pkg/R/generics.R|  5 +
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  2 +-
 4 files changed, 34 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 82e56ca..0cfe190 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -218,6 +218,7 @@ exportMethods("%in%",
   "mean",
   "min",
   "minute",
+  "monotonically_increasing_id",
   "month",
   "months_between",
   "n",

http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index a779127..0fb38bc 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -911,6 +911,33 @@ setMethod("minute",
 column(jc)
   })
 
+#' monotonically_increasing_id
+#'
+#' Return a column that generates monotonically increasing 64-bit integers.
+#'
+#' The generated ID is guaranteed to be monotonically increasing and unique, 
but not consecutive.
+#' The current implementation puts the partition ID in the upper 31 bits, and 
the record number
+#' within each partition in the lower 33 bits. The assumption is that the 
SparkDataFrame has
+#' less than 1 billion partitions, and each partition has less than 8 billion 
records.
+#'
+#' As an example, consider a SparkDataFrame with two partitions, each with 3 
records.
+#' This expression would return the following IDs:
+#' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+#'
+#' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL.
+#'
+#' @rdname monotonically_increasing_id
+#' @name monotonically_increasing_id
+#' @family misc_funcs
+#' @export
+#' @examples \dontrun{select(df, monotonically_increasing_id())}
+setMethod("monotonically_increasing_id",
+  signature(x = "missing"),
+  function() {
+jc <- callJStatic("org.apache.spark.sql.functions", 
"monotonically_increasing_id")
+column(jc)
+  })
+
 #' month
 #'
 #' Extracts the month as an integer from a given date/timestamp/string.

http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 6e754af..37d0556 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -993,6 +993,11 @@ setGeneric("md5", function(x) { standardGeneric("md5") })
 #' @export
 setGeneric("minute", function(x) { standardGeneric("minute") })
 
+#' @rdname monotonically_increasing_id
+#' @export
+setGeneric("monotonically_increasing_id",
+   function(x) { standardGeneric("monotonically_increasing_id") })
+
 #' @rdname month
 #' @export
 setGeneric("month", function(x) { standardGeneric("month") })

http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index fcc2ab3..c5c

spark git commit: [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR

2016-06-20 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 5cfabec87 -> 961342489


[SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR

## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API 
parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()name age
1 0 Michael  NA
2 1Andy  30
3 2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun 

Closes #13774 from dongjoon-hyun/SPARK-16059.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96134248
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96134248
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96134248

Branch: refs/heads/master
Commit: 9613424898fd2a586156bc4eb48e255749774f20
Parents: 5cfabec
Author: Dongjoon Hyun 
Authored: Mon Jun 20 11:12:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 20 11:12:41 2016 -0700

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 27 ++
 R/pkg/R/generics.R|  5 +
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  2 +-
 4 files changed, 34 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 82e56ca..0cfe190 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -218,6 +218,7 @@ exportMethods("%in%",
   "mean",
   "min",
   "minute",
+  "monotonically_increasing_id",
   "month",
   "months_between",
   "n",

http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index a779127..0fb38bc 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -911,6 +911,33 @@ setMethod("minute",
 column(jc)
   })
 
+#' monotonically_increasing_id
+#'
+#' Return a column that generates monotonically increasing 64-bit integers.
+#'
+#' The generated ID is guaranteed to be monotonically increasing and unique, 
but not consecutive.
+#' The current implementation puts the partition ID in the upper 31 bits, and 
the record number
+#' within each partition in the lower 33 bits. The assumption is that the 
SparkDataFrame has
+#' less than 1 billion partitions, and each partition has less than 8 billion 
records.
+#'
+#' As an example, consider a SparkDataFrame with two partitions, each with 3 
records.
+#' This expression would return the following IDs:
+#' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+#'
+#' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL.
+#'
+#' @rdname monotonically_increasing_id
+#' @name monotonically_increasing_id
+#' @family misc_funcs
+#' @export
+#' @examples \dontrun{select(df, monotonically_increasing_id())}
+setMethod("monotonically_increasing_id",
+  signature(x = "missing"),
+  function() {
+jc <- callJStatic("org.apache.spark.sql.functions", 
"monotonically_increasing_id")
+column(jc)
+  })
+
 #' month
 #'
 #' Extracts the month as an integer from a given date/timestamp/string.

http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 6e754af..37d0556 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -993,6 +993,11 @@ setGeneric("md5", function(x) { standardGeneric("md5") })
 #' @export
 setGeneric("minute", function(x) { standardGeneric("minute") })
 
+#' @rdname monotonically_increasing_id
+#' @export
+setGeneric("monotonically_increasing_id",
+   function(x) { standardGeneric("monotonically_increasing_id") })
+
 #' @rdname month
 #' @export
 setGeneric("month", function(x) { standardGeneric("month") })

http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index fcc2ab3..c5c5a06 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -1047

spark git commit: [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite

2016-06-20 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master 905f774b7 -> 5cfabec87


[SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite

## What changes were proposed in this pull request?

ConsoleSinkSuite just collects content from stdout and compare them with the 
expected string. However, because Spark may not stop some background threads at 
once, there is a race condition that other threads are outputting logs to 
**stdout** while ConsoleSinkSuite is running. Then it will make 
ConsoleSinkSuite fail.

Therefore, I just deleted `ConsoleSinkSuite`. If we want to test 
ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it 
testable instead of depending on stdout. Therefore, this test is useless and I 
just delete it.

## How was this patch tested?

Just removed a flaky test.

Author: Shixiong Zhu 

Closes #13776 from zsxwing/SPARK-16050.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5cfabec8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5cfabec8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5cfabec8

Branch: refs/heads/master
Commit: 5cfabec8724714b897d6e23e670c39e58f554ea2
Parents: 905f774
Author: Shixiong Zhu 
Authored: Mon Jun 20 10:35:37 2016 -0700
Committer: Michael Armbrust 
Committed: Mon Jun 20 10:35:37 2016 -0700

--
 .../execution/streaming/ConsoleSinkSuite.scala  | 99 
 1 file changed, 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5cfabec8/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
deleted file mode 100644
index e853d8c..000
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
+++ /dev/null
@@ -1,99 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.execution.streaming
-
-import java.io.{ByteArrayOutputStream, PrintStream}
-import java.nio.charset.StandardCharsets.UTF_8
-
-import org.scalatest.BeforeAndAfter
-
-import org.apache.spark.sql.streaming.StreamTest
-
-class ConsoleSinkSuite extends StreamTest with BeforeAndAfter {
-
-  import testImplicits._
-
-  after {
-sqlContext.streams.active.foreach(_.stop())
-  }
-
-  test("SPARK-16020 Complete mode aggregation with console sink") {
-withTempDir { checkpointLocation =>
-  val origOut = System.out
-  val stdout = new ByteArrayOutputStream()
-  try {
-// Hook Java System.out.println
-System.setOut(new PrintStream(stdout))
-// Hook Scala println
-Console.withOut(stdout) {
-  val input = MemoryStream[String]
-  val df = input.toDF().groupBy("value").count()
-  val query = df.writeStream
-.format("console")
-.outputMode("complete")
-.option("checkpointLocation", checkpointLocation.getAbsolutePath)
-.start()
-  input.addData("a")
-  query.processAllAvailable()
-  input.addData("a", "b")
-  query.processAllAvailable()
-  input.addData("a", "b", "c")
-  query.processAllAvailable()
-  query.stop()
-}
-System.out.flush()
-  } finally {
-System.setOut(origOut)
-  }
-
-  val expected = """---
-|Batch: 0
-|---
-|+-+-+
-||value|count|
-|+-+-+
-||a|1|
-|+-+-+
-|
-|---
-|Batch: 1
-|---
-|+-+-+
-||value|count|
-|+-+-+
-||a|2|
-||b|1|
-|+-+-+
-

spark git commit: [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite

2016-06-20 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 0b0b5fe54 -> 363db9f8b


[SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite

## What changes were proposed in this pull request?

ConsoleSinkSuite just collects content from stdout and compare them with the 
expected string. However, because Spark may not stop some background threads at 
once, there is a race condition that other threads are outputting logs to 
**stdout** while ConsoleSinkSuite is running. Then it will make 
ConsoleSinkSuite fail.

Therefore, I just deleted `ConsoleSinkSuite`. If we want to test 
ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it 
testable instead of depending on stdout. Therefore, this test is useless and I 
just delete it.

## How was this patch tested?

Just removed a flaky test.

Author: Shixiong Zhu 

Closes #13776 from zsxwing/SPARK-16050.

(cherry picked from commit 5cfabec8724714b897d6e23e670c39e58f554ea2)
Signed-off-by: Michael Armbrust 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/363db9f8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/363db9f8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/363db9f8

Branch: refs/heads/branch-2.0
Commit: 363db9f8be53773238854ab16c3459ba46a6961b
Parents: 0b0b5fe
Author: Shixiong Zhu 
Authored: Mon Jun 20 10:35:37 2016 -0700
Committer: Michael Armbrust 
Committed: Mon Jun 20 10:35:49 2016 -0700

--
 .../execution/streaming/ConsoleSinkSuite.scala  | 99 
 1 file changed, 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/363db9f8/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
deleted file mode 100644
index e853d8c..000
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala
+++ /dev/null
@@ -1,99 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.execution.streaming
-
-import java.io.{ByteArrayOutputStream, PrintStream}
-import java.nio.charset.StandardCharsets.UTF_8
-
-import org.scalatest.BeforeAndAfter
-
-import org.apache.spark.sql.streaming.StreamTest
-
-class ConsoleSinkSuite extends StreamTest with BeforeAndAfter {
-
-  import testImplicits._
-
-  after {
-sqlContext.streams.active.foreach(_.stop())
-  }
-
-  test("SPARK-16020 Complete mode aggregation with console sink") {
-withTempDir { checkpointLocation =>
-  val origOut = System.out
-  val stdout = new ByteArrayOutputStream()
-  try {
-// Hook Java System.out.println
-System.setOut(new PrintStream(stdout))
-// Hook Scala println
-Console.withOut(stdout) {
-  val input = MemoryStream[String]
-  val df = input.toDF().groupBy("value").count()
-  val query = df.writeStream
-.format("console")
-.outputMode("complete")
-.option("checkpointLocation", checkpointLocation.getAbsolutePath)
-.start()
-  input.addData("a")
-  query.processAllAvailable()
-  input.addData("a", "b")
-  query.processAllAvailable()
-  input.addData("a", "b", "c")
-  query.processAllAvailable()
-  query.stop()
-}
-System.out.flush()
-  } finally {
-System.setOut(origOut)
-  }
-
-  val expected = """---
-|Batch: 0
-|---
-|+-+-+
-||value|count|
-|+-+-+
-||a|1|
-|+-+-+
-|
-|---
-|Batch: 1
-|---
-|+-+-+
-||value|c

spark git commit: [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2.

2016-06-20 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 208348595 -> 16b7f1dfc


[SPARK-14391][LAUNCHER] Fix launcher communication test, take 2.

There's actually a race here: the state of the handler was changed before
the connection was set, so the test code could be notified of the state
change, wake up, and still see the connection as null, triggering the assert.

Author: Marcelo Vanzin 

Closes #12785 from vanzin/SPARK-14391.

(cherry picked from commit 73c20bf32524c2232febc8c4b12d5fa228347163)


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16b7f1df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16b7f1df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16b7f1df

Branch: refs/heads/branch-1.6
Commit: 16b7f1dfc0570f32e23f640e063d8e7fd9115792
Parents: 2083485
Author: Marcelo Vanzin 
Authored: Fri Apr 29 23:13:50 2016 -0700
Committer: Marcelo Vanzin 
Committed: Mon Jun 20 09:55:06 2016 -0700

--
 .../src/main/java/org/apache/spark/launcher/LauncherServer.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16b7f1df/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java 
b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
index 414ffc2..e493514 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
@@ -298,8 +298,8 @@ class LauncherServer implements Closeable {
   Hello hello = (Hello) msg;
   ChildProcAppHandle handle = pending.remove(hello.secret);
   if (handle != null) {
-handle.setState(SparkAppHandle.State.CONNECTED);
 handle.setConnection(this);
+handle.setState(SparkAppHandle.State.CONNECTED);
 this.handle = handle;
   } else {
 throw new IllegalArgumentException("Received Hello for unknown 
client.");


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 19397caab -> 0b0b5fe54


[SPARK-16030][SQL] Allow specifying static partitions when inserting to data 
source tables

## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target 
table is a data source table.

## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.

**Note: This PR is based on https://github.com/apache/spark/pull/13766. The 
last commit is the actual change.**

Author: Yin Huai 

Closes #13769 from yhuai/SPARK-16030-1.

(cherry picked from commit 905f774b71f4b814d5a2412c7c35bd023c3dfdf8)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b0b5fe5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b0b5fe5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b0b5fe5

Branch: refs/heads/branch-2.0
Commit: 0b0b5fe549086171d851d7c4458d48be9409380f
Parents: 19397ca
Author: Yin Huai 
Authored: Mon Jun 20 20:17:47 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jun 20 20:18:17 2016 +0800

--
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  19 ++
 .../datasources/DataSourceStrategy.scala| 127 +++-
 .../spark/sql/execution/datasources/rules.scala |   7 -
 .../spark/sql/internal/SessionState.scala   |   2 +-
 .../sql/sources/DataSourceAnalysisSuite.scala   | 202 +++
 .../spark/sql/hive/HiveSessionState.scala   |   2 +-
 .../hive/execution/InsertIntoHiveTable.scala|   3 +-
 .../sql/hive/InsertIntoHiveTableSuite.scala |  97 -
 .../sql/hive/execution/HiveQuerySuite.scala |   2 +-
 9 files changed, 436 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b0b5fe5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 7b451ba..8992276 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -313,6 +313,8 @@ trait CheckAnalysis extends PredicateHelper {
  |${s.catalogTable.identifier}
""".stripMargin)
 
+  // TODO: We need to consolidate this kind of checks for 
InsertIntoTable
+  // with the rule of PreWriteCheck defined in extendedCheckRules.
   case InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _) =>
 failAnalysis(
   s"""
@@ -320,6 +322,23 @@ trait CheckAnalysis extends PredicateHelper {
  |${s.catalogTable.identifier}
""".stripMargin)
 
+  case InsertIntoTable(t, _, _, _, _)
+if !t.isInstanceOf[LeafNode] ||
+  t == OneRowRelation ||
+  t.isInstanceOf[LocalRelation] =>
+failAnalysis(s"Inserting into an RDD-based table is not allowed.")
+
+  case i @ InsertIntoTable(table, partitions, query, _, _) =>
+val numStaticPartitions = partitions.values.count(_.isDefined)
+if (table.output.size != (query.output.size + 
numStaticPartitions)) {
+  failAnalysis(
+s"$table requires that the data to be inserted have the same 
number of " +
+  s"columns as the target table: target table has 
${table.output.size} " +
+  s"column(s) but the inserted data has " +
+  s"${query.output.size + numStaticPartitions} column(s), 
including " +
+  s"$numStaticPartitions partition column(s) having constant 
value(s).")
+}
+
   case o if !o.resolved =>
 failAnalysis(
   s"unresolved operator ${operator.simpleString}")

http://git-wip-us.apache.org/repos/asf/spark/blob/0b0b5fe5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 2b47865..27133f0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -22,15 +22,15 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.spark.internal.Logging

spark git commit: [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables

2016-06-20 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master 6d0f921ae -> 905f774b7


[SPARK-16030][SQL] Allow specifying static partitions when inserting to data 
source tables

## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target 
table is a data source table.

## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.

**Note: This PR is based on https://github.com/apache/spark/pull/13766. The 
last commit is the actual change.**

Author: Yin Huai 

Closes #13769 from yhuai/SPARK-16030-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/905f774b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/905f774b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/905f774b

Branch: refs/heads/master
Commit: 905f774b71f4b814d5a2412c7c35bd023c3dfdf8
Parents: 6d0f921
Author: Yin Huai 
Authored: Mon Jun 20 20:17:47 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jun 20 20:17:47 2016 +0800

--
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  19 ++
 .../datasources/DataSourceStrategy.scala| 127 +++-
 .../spark/sql/execution/datasources/rules.scala |   7 -
 .../spark/sql/internal/SessionState.scala   |   2 +-
 .../sql/sources/DataSourceAnalysisSuite.scala   | 202 +++
 .../spark/sql/hive/HiveSessionState.scala   |   2 +-
 .../hive/execution/InsertIntoHiveTable.scala|   3 +-
 .../sql/hive/InsertIntoHiveTableSuite.scala |  97 -
 .../sql/hive/execution/HiveQuerySuite.scala |   2 +-
 9 files changed, 436 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/905f774b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 7b451ba..8992276 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -313,6 +313,8 @@ trait CheckAnalysis extends PredicateHelper {
  |${s.catalogTable.identifier}
""".stripMargin)
 
+  // TODO: We need to consolidate this kind of checks for 
InsertIntoTable
+  // with the rule of PreWriteCheck defined in extendedCheckRules.
   case InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _) =>
 failAnalysis(
   s"""
@@ -320,6 +322,23 @@ trait CheckAnalysis extends PredicateHelper {
  |${s.catalogTable.identifier}
""".stripMargin)
 
+  case InsertIntoTable(t, _, _, _, _)
+if !t.isInstanceOf[LeafNode] ||
+  t == OneRowRelation ||
+  t.isInstanceOf[LocalRelation] =>
+failAnalysis(s"Inserting into an RDD-based table is not allowed.")
+
+  case i @ InsertIntoTable(table, partitions, query, _, _) =>
+val numStaticPartitions = partitions.values.count(_.isDefined)
+if (table.output.size != (query.output.size + 
numStaticPartitions)) {
+  failAnalysis(
+s"$table requires that the data to be inserted have the same 
number of " +
+  s"columns as the target table: target table has 
${table.output.size} " +
+  s"column(s) but the inserted data has " +
+  s"${query.output.size + numStaticPartitions} column(s), 
including " +
+  s"$numStaticPartitions partition column(s) having constant 
value(s).")
+}
+
   case o if !o.resolved =>
 failAnalysis(
   s"unresolved operator ${operator.simpleString}")

http://git-wip-us.apache.org/repos/asf/spark/blob/905f774b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 2b47865..27133f0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -22,15 +22,15 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.{Ca

57 matches

Mail list logo