spark git commit: [SPARKR][DOCS] R code doc cleanup
Repository: spark Updated Branches: refs/heads/branch-2.0 4e193d3da -> 38f3b76bd [SPARKR][DOCS] R code doc cleanup ## What changes were proposed in this pull request? I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc. There are still more doc issues to be cleaned up. ## How was this patch tested? manual tests Author: Felix Cheung Closes #13798 from felixcheung/rdocseealso. (cherry picked from commit 09f4ceaeb0a99874f774e09d868fdf907ecf256f) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38f3b76b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38f3b76b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38f3b76b Branch: refs/heads/branch-2.0 Commit: 38f3b76bd6b4a3e4d20048beeb92275ebf93c8d8 Parents: 4e193d3 Author: Felix Cheung Authored: Mon Jun 20 23:51:08 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 23:51:20 2016 -0700 -- R/pkg/R/DataFrame.R | 39 ++- R/pkg/R/SQLContext.R | 6 +++--- R/pkg/R/column.R | 6 ++ R/pkg/R/context.R| 5 +++-- R/pkg/R/functions.R | 40 +--- R/pkg/R/generics.R | 44 ++-- R/pkg/R/mllib.R | 6 -- R/pkg/R/sparkR.R | 8 +--- 8 files changed, 70 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/38f3b76b/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index b3f2dd8..a8ade1a 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -463,6 +463,7 @@ setMethod("createOrReplaceTempView", }) #' (Deprecated) Register Temporary Table +#' #' Registers a SparkDataFrame as a Temporary Table in the SQLContext #' @param x A SparkDataFrame #' @param tableName A character vector containing the name of the table @@ -606,10 +607,10 @@ setMethod("unpersist", #' #' The following options for repartition are possible: #' \itemize{ -#' \item{"Option 1"} {Return a new SparkDataFrame partitioned by +#' \item{1.} {Return a new SparkDataFrame partitioned by #' the given columns into `numPartitions`.} -#' \item{"Option 2"} {Return a new SparkDataFrame that has exactly `numPartitions`.} -#' \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given column(s), +#' \item{2.} {Return a new SparkDataFrame that has exactly `numPartitions`.} +#' \item{3.} {Return a new SparkDataFrame partitioned by the given column(s), #' using `spark.sql.shuffle.partitions` as number of partitions.} #'} #' @param x A SparkDataFrame @@ -1053,7 +1054,7 @@ setMethod("limit", dataFrame(res) }) -#' Take the first NUM rows of a SparkDataFrame and return a the results as a data.frame +#' Take the first NUM rows of a SparkDataFrame and return a the results as a R data.frame #' #' @family SparkDataFrame functions #' @rdname take @@ -1076,7 +1077,7 @@ setMethod("take", #' Head #' -#' Return the first NUM rows of a SparkDataFrame as a data.frame. If NUM is NULL, +#' Return the first NUM rows of a SparkDataFrame as a R data.frame. If NUM is NULL, #' then head() returns the first 6 rows in keeping with the current data.frame #' convention in R. #' @@ -1157,7 +1158,6 @@ setMethod("toRDD", #' #' @param x a SparkDataFrame #' @return a GroupedData -#' @seealso GroupedData #' @family SparkDataFrame functions #' @rdname groupBy #' @name groupBy @@ -1242,9 +1242,9 @@ dapplyInternal <- function(x, func, schema) { #' #' @param x A SparkDataFrame #' @param func A function to be applied to each partition of the SparkDataFrame. -#' func should have only one parameter, to which a data.frame corresponds +#' func should have only one parameter, to which a R data.frame corresponds #' to each partition will be passed. -#' The output of func should be a data.frame. +#' The output of func should be a R data.frame. #' @param schema The schema of the resulting SparkDataFrame after the function is applied. #' It must match the output of func. #' @family SparkDataFrame functions @@ -1291,9 +1291,9 @@ setMethod("dapply", #' #' @param x A SparkDataFrame #' @param func A function to be applied to each partition of the SparkDataFrame. -#' func should have only one parameter, to which a data.frame corresponds +#' func should have only one parameter, to which a R data.frame corresponds #' to each partition will be passed. -#' The output of func should be a data.frame. +#' The output
spark git commit: [SPARKR][DOCS] R code doc cleanup
Repository: spark Updated Branches: refs/heads/master 41e0ffb19 -> 09f4ceaeb [SPARKR][DOCS] R code doc cleanup ## What changes were proposed in this pull request? I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc. There are still more doc issues to be cleaned up. ## How was this patch tested? manual tests Author: Felix Cheung Closes #13798 from felixcheung/rdocseealso. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/09f4ceae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/09f4ceae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/09f4ceae Branch: refs/heads/master Commit: 09f4ceaeb0a99874f774e09d868fdf907ecf256f Parents: 41e0ffb Author: Felix Cheung Authored: Mon Jun 20 23:51:08 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 23:51:08 2016 -0700 -- R/pkg/R/DataFrame.R | 39 ++- R/pkg/R/SQLContext.R | 6 +++--- R/pkg/R/column.R | 6 ++ R/pkg/R/context.R| 5 +++-- R/pkg/R/functions.R | 40 +--- R/pkg/R/generics.R | 44 ++-- R/pkg/R/mllib.R | 6 -- R/pkg/R/sparkR.R | 8 +--- 8 files changed, 70 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/09f4ceae/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index b3f2dd8..a8ade1a 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -463,6 +463,7 @@ setMethod("createOrReplaceTempView", }) #' (Deprecated) Register Temporary Table +#' #' Registers a SparkDataFrame as a Temporary Table in the SQLContext #' @param x A SparkDataFrame #' @param tableName A character vector containing the name of the table @@ -606,10 +607,10 @@ setMethod("unpersist", #' #' The following options for repartition are possible: #' \itemize{ -#' \item{"Option 1"} {Return a new SparkDataFrame partitioned by +#' \item{1.} {Return a new SparkDataFrame partitioned by #' the given columns into `numPartitions`.} -#' \item{"Option 2"} {Return a new SparkDataFrame that has exactly `numPartitions`.} -#' \item{"Option 3"} {Return a new SparkDataFrame partitioned by the given column(s), +#' \item{2.} {Return a new SparkDataFrame that has exactly `numPartitions`.} +#' \item{3.} {Return a new SparkDataFrame partitioned by the given column(s), #' using `spark.sql.shuffle.partitions` as number of partitions.} #'} #' @param x A SparkDataFrame @@ -1053,7 +1054,7 @@ setMethod("limit", dataFrame(res) }) -#' Take the first NUM rows of a SparkDataFrame and return a the results as a data.frame +#' Take the first NUM rows of a SparkDataFrame and return a the results as a R data.frame #' #' @family SparkDataFrame functions #' @rdname take @@ -1076,7 +1077,7 @@ setMethod("take", #' Head #' -#' Return the first NUM rows of a SparkDataFrame as a data.frame. If NUM is NULL, +#' Return the first NUM rows of a SparkDataFrame as a R data.frame. If NUM is NULL, #' then head() returns the first 6 rows in keeping with the current data.frame #' convention in R. #' @@ -1157,7 +1158,6 @@ setMethod("toRDD", #' #' @param x a SparkDataFrame #' @return a GroupedData -#' @seealso GroupedData #' @family SparkDataFrame functions #' @rdname groupBy #' @name groupBy @@ -1242,9 +1242,9 @@ dapplyInternal <- function(x, func, schema) { #' #' @param x A SparkDataFrame #' @param func A function to be applied to each partition of the SparkDataFrame. -#' func should have only one parameter, to which a data.frame corresponds +#' func should have only one parameter, to which a R data.frame corresponds #' to each partition will be passed. -#' The output of func should be a data.frame. +#' The output of func should be a R data.frame. #' @param schema The schema of the resulting SparkDataFrame after the function is applied. #' It must match the output of func. #' @family SparkDataFrame functions @@ -1291,9 +1291,9 @@ setMethod("dapply", #' #' @param x A SparkDataFrame #' @param func A function to be applied to each partition of the SparkDataFrame. -#' func should have only one parameter, to which a data.frame corresponds +#' func should have only one parameter, to which a R data.frame corresponds #' to each partition will be passed. -#' The output of func should be a data.frame. +#' The output of func should be a R data.frame. #' @family SparkDataFrame functions #' @rdname dapplyCollect #' @name dapplyCo
spark git commit: [SPARK-15894][SQL][DOC] Update docs for controlling #partitions
Repository: spark Updated Branches: refs/heads/master 58f6e27dd -> 41e0ffb19 [SPARK-15894][SQL][DOC] Update docs for controlling #partitions ## What changes were proposed in this pull request? Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options. ## How was this patch tested? N/A Author: Takeshi YAMAMURO Closes #13797 from maropu/SPARK-15894-2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e0ffb1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e0ffb1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e0ffb1 Branch: refs/heads/master Commit: 41e0ffb19f678e9b1e87f747a5e4e3d44964e39a Parents: 58f6e27 Author: Takeshi YAMAMURO Authored: Tue Jun 21 14:27:16 2016 +0800 Committer: Cheng Lian Committed: Tue Jun 21 14:27:16 2016 +0800 -- docs/sql-programming-guide.md | 17 + 1 file changed, 17 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/41e0ffb1/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 4206f73..ddf8f70 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -2016,6 +2016,23 @@ that these options will be deprecated in future release as more optimizations ar Property NameDefaultMeaning +spark.sql.files.maxPartitionBytes +134217728 (128 MB) + + The maximum number of bytes to pack into a single partition when reading files. + + + +spark.sql.files.openCostInBytes +4194304 (4 MB) + + The estimated cost to open a file, measured by the number of bytes could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over estimated, + then the partitions with small files will be faster than partitions with bigger files (which is + scheduled first). + + + spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15894][SQL][DOC] Update docs for controlling #partitions
Repository: spark Updated Branches: refs/heads/branch-2.0 dbf7f48b6 -> 4e193d3da [SPARK-15894][SQL][DOC] Update docs for controlling #partitions ## What changes were proposed in this pull request? Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options. ## How was this patch tested? N/A Author: Takeshi YAMAMURO Closes #13797 from maropu/SPARK-15894-2. (cherry picked from commit 41e0ffb19f678e9b1e87f747a5e4e3d44964e39a) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e193d3d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e193d3d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e193d3d Branch: refs/heads/branch-2.0 Commit: 4e193d3daf5bdfb38d7df6da5b7abdd53888ec99 Parents: dbf7f48 Author: Takeshi YAMAMURO Authored: Tue Jun 21 14:27:16 2016 +0800 Committer: Cheng Lian Committed: Tue Jun 21 14:27:31 2016 +0800 -- docs/sql-programming-guide.md | 17 + 1 file changed, 17 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4e193d3d/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 4206f73..ddf8f70 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -2016,6 +2016,23 @@ that these options will be deprecated in future release as more optimizations ar Property NameDefaultMeaning +spark.sql.files.maxPartitionBytes +134217728 (128 MB) + + The maximum number of bytes to pack into a single partition when reading files. + + + +spark.sql.files.openCostInBytes +4194304 (4 MB) + + The estimated cost to open a file, measured by the number of bytes could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over estimated, + then the partitions with small files will be faster than partitions with bigger files (which is + scheduled first). + + + spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R
Repository: spark Updated Branches: refs/heads/branch-2.0 4fc4eb943 -> dbf7f48b6 [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung Closes #13799 from felixcheung/rsqlprogrammingguide. (cherry picked from commit 58f6e27dd70f476f99ac8204e6b405bced4d6de1) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbf7f48b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbf7f48b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbf7f48b Branch: refs/heads/branch-2.0 Commit: dbf7f48b6e73f3500b0abe9055ac204a3f756418 Parents: 4fc4eb9 Author: Felix Cheung Authored: Tue Jun 21 13:56:37 2016 +0800 Committer: Cheng Lian Committed: Tue Jun 21 13:57:03 2016 +0800 -- docs/sparkr.md| 2 +- docs/sql-programming-guide.md | 34 -- 2 files changed, 17 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dbf7f48b/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 023bbcd..f018901 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -152,7 +152,7 @@ write.df(people, path="people.parquet", source="parquet", mode="overwrite") ### From Hive tables -You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sqlcontext). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`). +You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`). {% highlight r %} http://git-wip-us.apache.org/repos/asf/spark/blob/dbf7f48b/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index d93f30b..4206f73 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -107,19 +107,17 @@ spark = SparkSession.build \ -Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so -the entry point into all relational functionality in SparkR is still the -`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`. +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: {% highlight r %} -spark <- sparkRSQL.init(sc) +sparkR.session() {% endhighlight %} -Note that when invoked for the first time, `sparkRSQL.init()` initializes a global `SQLContext` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SQLContext` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SQLContext` instance around. +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. -`SparkSession` (or `SQLContext` for SparkR) in Spark 2.0 provides builtin support for Hive features including the ability to +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup. @@ -175,7 +173,7 @@ df.show()
spark git commit: [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R
Repository: spark Updated Branches: refs/heads/master 07367533d -> 58f6e27dd [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung Closes #13799 from felixcheung/rsqlprogrammingguide. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/58f6e27d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/58f6e27d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/58f6e27d Branch: refs/heads/master Commit: 58f6e27dd70f476f99ac8204e6b405bced4d6de1 Parents: 0736753 Author: Felix Cheung Authored: Tue Jun 21 13:56:37 2016 +0800 Committer: Cheng Lian Committed: Tue Jun 21 13:56:37 2016 +0800 -- docs/sparkr.md| 2 +- docs/sql-programming-guide.md | 34 -- 2 files changed, 17 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/58f6e27d/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 023bbcd..f018901 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -152,7 +152,7 @@ write.df(people, path="people.parquet", source="parquet", mode="overwrite") ### From Hive tables -You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sqlcontext). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`). +You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`). {% highlight r %} http://git-wip-us.apache.org/repos/asf/spark/blob/58f6e27d/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index d93f30b..4206f73 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -107,19 +107,17 @@ spark = SparkSession.build \ -Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so -the entry point into all relational functionality in SparkR is still the -`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`. +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: {% highlight r %} -spark <- sparkRSQL.init(sc) +sparkR.session() {% endhighlight %} -Note that when invoked for the first time, `sparkRSQL.init()` initializes a global `SQLContext` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SQLContext` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SQLContext` instance around. +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. -`SparkSession` (or `SQLContext` for SparkR) in Spark 2.0 provides builtin support for Hive features including the ability to +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup. @@ -175,7 +173,7 @@ df.show() -With a `SQLContext`, applications can create DataFrames from an [existing `RDD`](#interoperating-wit
spark git commit: [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0
Repository: spark Updated Branches: refs/heads/branch-2.0 12f00b6ed -> 4fc4eb943 [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0 This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon. Author: Eric Liang Closes #13744 from ericl/spark-16025. (cherry picked from commit 07367533de68817e1e6cf9cf2b056a04dd160c8a) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4fc4eb94 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4fc4eb94 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4fc4eb94 Branch: refs/heads/branch-2.0 Commit: 4fc4eb9434676d6c7be1b0dd8ff1dc67d7d2b308 Parents: 12f00b6 Author: Eric Liang Authored: Mon Jun 20 21:56:44 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:56:49 2016 -0700 -- docs/programming-guide.md | 5 + 1 file changed, 5 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4fc4eb94/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 97bcb51..3872aec 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -1220,6 +1220,11 @@ storage levels is: MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. + + OFF_HEAP (experimental) + Similar to MEMORY_ONLY_SER, but store the data in +off-heap memory. This requires off-heap memory to be enabled. + **Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0
Repository: spark Updated Branches: refs/heads/master 4f7f1c436 -> 07367533d [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0 This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon. Author: Eric Liang Closes #13744 from ericl/spark-16025. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07367533 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07367533 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07367533 Branch: refs/heads/master Commit: 07367533de68817e1e6cf9cf2b056a04dd160c8a Parents: 4f7f1c4 Author: Eric Liang Authored: Mon Jun 20 21:56:44 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:56:44 2016 -0700 -- docs/programming-guide.md | 5 + 1 file changed, 5 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/07367533/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 97bcb51..3872aec 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -1220,6 +1220,11 @@ storage levels is: MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. + + OFF_HEAP (experimental) + Similar to MEMORY_ONLY_SER, but store the data in +off-heap memory. This requires off-heap memory to be enabled. + **Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD
Repository: spark Updated Branches: refs/heads/branch-2.0 9d513b8d2 -> 12f00b6ed [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD ## What changes were proposed in this pull request? This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47). The codes with the external data sources below: ```scala df.select(input_file_name).show() ``` will produce - **Before** ``` +-+ |input_file_name()| +-+ | | +-+ ``` - **After** ``` ++ | input_file_name()| ++ |file:/private/var...| ++ ``` ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon Closes #13759 from HyukjinKwon/SPARK-16044. (cherry picked from commit 4f7f1c436205630ab77d3758d7210cc1a2f0d04a) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12f00b6e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12f00b6e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12f00b6e Branch: refs/heads/branch-2.0 Commit: 12f00b6edde9b6f97d2450e2cd99edd5e31b9169 Parents: 9d513b8 Author: hyukjinkwon Authored: Mon Jun 20 21:55:34 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:55:40 2016 -0700 -- .../apache/spark/rdd/InputFileNameHolder.scala | 2 +- .../org/apache/spark/rdd/NewHadoopRDD.scala | 7 .../spark/sql/ColumnExpressionSuite.scala | 34 ++-- 3 files changed, 40 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala index 108e9d2..f40d4c8 100644 --- a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala +++ b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala @@ -21,7 +21,7 @@ import org.apache.spark.unsafe.types.UTF8String /** * This holds file names of the current Spark task. This is used in HadoopRDD, - * FileScanRDD and InputFileName function in Spark SQL. + * FileScanRDD, NewHadoopRDD and InputFileName function in Spark SQL. */ private[spark] object InputFileNameHolder { /** http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala index 189dc7b..b086baa 100644 --- a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala @@ -135,6 +135,12 @@ class NewHadoopRDD[K, V]( val inputMetrics = context.taskMetrics().inputMetrics val existingBytesRead = inputMetrics.bytesRead + // Sets the thread local variable for the file's name + split.serializableHadoopSplit.value match { +case fs: FileSplit => InputFileNameHolder.setInputFileName(fs.getPath.toString) +case _ => InputFileNameHolder.unsetInputFileName() + } + // Find a function that will return the FileSystem bytes read by this thread. Do this before // creating RecordReader, because RecordReader's constructor might read some bytes val getBytesReadCallback: Option[() => Long] = split.serializableHadoopSplit.value match { @@ -201,6 +207,7 @@ class NewHadoopRDD[K, V]( private def close() { if (reader != null) { + InputFileNameHolder.unsetInputFileName() // Close the reader and release it. Note: it's very important that we don't close the // reader more than once, since that exposes us to MAPREDUCE-5918 when running against // Hadoop 1.x and older Hadoop 2.x releases. That bug can lead to non-deterministic http://git-wip-us.apache.org/repos/asf/spark/blob/12f00b6e/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.
spark git commit: [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD
Repository: spark Updated Branches: refs/heads/master 18a8a9b1f -> 4f7f1c436 [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD ## What changes were proposed in this pull request? This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47). The codes with the external data sources below: ```scala df.select(input_file_name).show() ``` will produce - **Before** ``` +-+ |input_file_name()| +-+ | | +-+ ``` - **After** ``` ++ | input_file_name()| ++ |file:/private/var...| ++ ``` ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon Closes #13759 from HyukjinKwon/SPARK-16044. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4f7f1c43 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4f7f1c43 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4f7f1c43 Branch: refs/heads/master Commit: 4f7f1c436205630ab77d3758d7210cc1a2f0d04a Parents: 18a8a9b Author: hyukjinkwon Authored: Mon Jun 20 21:55:34 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:55:34 2016 -0700 -- .../apache/spark/rdd/InputFileNameHolder.scala | 2 +- .../org/apache/spark/rdd/NewHadoopRDD.scala | 7 .../spark/sql/ColumnExpressionSuite.scala | 34 ++-- 3 files changed, 40 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala index 108e9d2..f40d4c8 100644 --- a/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala +++ b/core/src/main/scala/org/apache/spark/rdd/InputFileNameHolder.scala @@ -21,7 +21,7 @@ import org.apache.spark.unsafe.types.UTF8String /** * This holds file names of the current Spark task. This is used in HadoopRDD, - * FileScanRDD and InputFileName function in Spark SQL. + * FileScanRDD, NewHadoopRDD and InputFileName function in Spark SQL. */ private[spark] object InputFileNameHolder { /** http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala index 189dc7b..b086baa 100644 --- a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala @@ -135,6 +135,12 @@ class NewHadoopRDD[K, V]( val inputMetrics = context.taskMetrics().inputMetrics val existingBytesRead = inputMetrics.bytesRead + // Sets the thread local variable for the file's name + split.serializableHadoopSplit.value match { +case fs: FileSplit => InputFileNameHolder.setInputFileName(fs.getPath.toString) +case _ => InputFileNameHolder.unsetInputFileName() + } + // Find a function that will return the FileSystem bytes read by this thread. Do this before // creating RecordReader, because RecordReader's constructor might read some bytes val getBytesReadCallback: Option[() => Long] = split.serializableHadoopSplit.value match { @@ -201,6 +207,7 @@ class NewHadoopRDD[K, V]( private def close() { if (reader != null) { + InputFileNameHolder.unsetInputFileName() // Close the reader and release it. Note: it's very important that we don't close the // reader more than once, since that exposes us to MAPREDUCE-5918 when running against // Hadoop 1.x and older Hadoop 2.x releases. That bug can lead to non-deterministic http://git-wip-us.apache.org/repos/asf/spark/blob/4f7f1c43/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala index e89fa32..a66c83d 1
spark git commit: [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API
Repository: spark Updated Branches: refs/heads/master d9a3a2a0b -> 18a8a9b1f [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API ## What changes were proposed in this pull request? Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. ## How was this patch tested? Unit tests in Scala and Java. Author: Xiangrui Meng Closes #13789 from mengxr/SPARK-16074. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18a8a9b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18a8a9b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18a8a9b1 Branch: refs/heads/master Commit: 18a8a9b1f4114211cd108efda5672f2bd2c6e5cd Parents: d9a3a2a Author: Xiangrui Meng Authored: Mon Jun 20 21:51:02 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:51:02 2016 -0700 -- .../org/apache/spark/ml/linalg/dataTypes.scala | 35 .../spark/ml/linalg/JavaSQLDataTypesSuite.java | 31 + .../spark/ml/linalg/SQLDataTypesSuite.scala | 27 +++ 3 files changed, 93 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/18a8a9b1/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala new file mode 100644 index 000..52a6fd2 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.linalg + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.types.DataType + +/** + * :: DeveloperApi :: + * SQL data types for vectors and matrices. + */ +@DeveloperApi +object sqlDataTypes { + + /** Data type for [[Vector]]. */ + val VectorType: DataType = new VectorUDT + + /** Data type for [[Matrix]]. */ + val MatrixType: DataType = new MatrixUDT +} http://git-wip-us.apache.org/repos/asf/spark/blob/18a8a9b1/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java -- diff --git a/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java new file mode 100644 index 000..b09e131 --- /dev/null +++ b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.linalg; + +import org.junit.Assert; +import org.junit.Test; + +import static org.apache.spark.ml.linalg.sqlDataTypes.*; + +public class JavaSQLDataTypesSuite { + @Test + public void testSQLDataTypes() { +Assert.assertEquals(new VectorUDT(), VectorType()); +Assert.assertEquals(new MatrixUDT(), MatrixType()); + } +} http://git-wip-us
spark git commit: [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API
Repository: spark Updated Branches: refs/heads/branch-2.0 b998c33c0 -> 9d513b8d2 [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API ## What changes were proposed in this pull request? Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. ## How was this patch tested? Unit tests in Scala and Java. Author: Xiangrui Meng Closes #13789 from mengxr/SPARK-16074. (cherry picked from commit 18a8a9b1f4114211cd108efda5672f2bd2c6e5cd) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d513b8d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d513b8d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d513b8d Branch: refs/heads/branch-2.0 Commit: 9d513b8d220657cd6c4ab6f182f446b4107d Parents: b998c33 Author: Xiangrui Meng Authored: Mon Jun 20 21:51:02 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:51:06 2016 -0700 -- .../org/apache/spark/ml/linalg/dataTypes.scala | 35 .../spark/ml/linalg/JavaSQLDataTypesSuite.java | 31 + .../spark/ml/linalg/SQLDataTypesSuite.scala | 27 +++ 3 files changed, 93 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9d513b8d/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala new file mode 100644 index 000..52a6fd2 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.linalg + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.types.DataType + +/** + * :: DeveloperApi :: + * SQL data types for vectors and matrices. + */ +@DeveloperApi +object sqlDataTypes { + + /** Data type for [[Vector]]. */ + val VectorType: DataType = new VectorUDT + + /** Data type for [[Matrix]]. */ + val MatrixType: DataType = new MatrixUDT +} http://git-wip-us.apache.org/repos/asf/spark/blob/9d513b8d/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java -- diff --git a/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java new file mode 100644 index 000..b09e131 --- /dev/null +++ b/mllib/src/test/java/org/apache/spark/ml/linalg/JavaSQLDataTypesSuite.java @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.linalg; + +import org.junit.Assert; +import org.junit.Test; + +import static org.apache.spark.ml.linalg.sqlDataTypes.*; + +public class JavaSQLDataTypesSuite { + @Test + public void testSQLDataTypes() { +Assert.assertEquals(new Vecto
spark git commit: [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source
Repository: spark Updated Branches: refs/heads/branch-2.0 603424c16 -> b998c33c0 [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source What changes were proposed in this pull request? This PR is to fix the following bugs: **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` **Before code changes:** The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` **After code changes:** Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` **Issue 2: numPartitions is more than the number of key values between upper and lower bounds** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` **Before code changes:** Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` **After code changes:** Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` **Issue 3: java.lang.ArithmeticException when numPartitions is zero** ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` **Before code changes:** Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` **After code changes:** Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero How was this patch tested? Added test cases to verify the results Author: gatorsmile Closes #13773 from gatorsmile/jdbcPartitioning. (cherry picked from commit d9a3a2a0bec504d17d3b94104d449ee3bd850120) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b998c33c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b998c33c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b998c33c Branch: refs/heads/branch-2.0 Commit: b998c33c0d38f8f724d8846bc8e919ec8b92012e Parents: 603424c Author: gatorsmile Authored: Mon Jun 20 21:49:33 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:49:39 2016 -0700 -- .../datasources/jdbc/JDBCRelation.scala | 48 ++- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 65 2 files changed, 98 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b998c33c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala index 233b789..11613dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala @@ -21,6 +21,7 @@ import java.util.Properties import scala.collection.mutable.ArrayBuffer +import org.apache.spark.internal.Logging import org.apache.spark.Partition import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession, SQLContext} @@ -36,7 +37,7 @@ private[sql] case class JDBCPartitioningInfo( upperBound: Long, numPartitions: Int) -private[sql] object JDBCRelation { +private[sql] object JDBCRelation extends Logging { /** * Given a partitioning schematic (a column of integral type, a number of * partitions, and upper and lower bounds on the column's value), generate @@ -52,29 +53,46 @@ private[sql] object JDBCRelation { * @return an array of partitions with where clause for each partition */ def columnPartition(partitioning: JDBCPartitioningInfo): Array[Partition] = { -if (parti
spark git commit: [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source
Repository: spark Updated Branches: refs/heads/master c775bf09e -> d9a3a2a0b [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source What changes were proposed in this pull request? This PR is to fix the following bugs: **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` **Before code changes:** The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` **After code changes:** Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` **Issue 2: numPartitions is more than the number of key values between upper and lower bounds** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` **Before code changes:** Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` **After code changes:** Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` **Issue 3: java.lang.ArithmeticException when numPartitions is zero** ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` **Before code changes:** Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` **After code changes:** Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero How was this patch tested? Added test cases to verify the results Author: gatorsmile Closes #13773 from gatorsmile/jdbcPartitioning. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d9a3a2a0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d9a3a2a0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d9a3a2a0 Branch: refs/heads/master Commit: d9a3a2a0bec504d17d3b94104d449ee3bd850120 Parents: c775bf0 Author: gatorsmile Authored: Mon Jun 20 21:49:33 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:49:33 2016 -0700 -- .../datasources/jdbc/JDBCRelation.scala | 48 ++- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 65 2 files changed, 98 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d9a3a2a0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala index 233b789..11613dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala @@ -21,6 +21,7 @@ import java.util.Properties import scala.collection.mutable.ArrayBuffer +import org.apache.spark.internal.Logging import org.apache.spark.Partition import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession, SQLContext} @@ -36,7 +37,7 @@ private[sql] case class JDBCPartitioningInfo( upperBound: Long, numPartitions: Int) -private[sql] object JDBCRelation { +private[sql] object JDBCRelation extends Logging { /** * Given a partitioning schematic (a column of integral type, a number of * partitions, and upper and lower bounds on the column's value), generate @@ -52,29 +53,46 @@ private[sql] object JDBCRelation { * @return an array of partitions with where clause for each partition */ def columnPartition(partitioning: JDBCPartitioningInfo): Array[Partition] = { -if (partitioning == null) return Array[Partition](JDBCPartition(null, 0)) +if (partitioning == null || partitio
spark git commit: [SPARK-13792][SQL] Limit logging of bad records in CSV data source
Repository: spark Updated Branches: refs/heads/branch-2.0 10c476fc8 -> 603424c16 [SPARK-13792][SQL] Limit logging of bad records in CSV data source ## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin Closes #13795 from rxin/SPARK-13792. (cherry picked from commit c775bf09e0c3540f76de3f15d3fd35112a4912c1) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/603424c1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/603424c1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/603424c1 Branch: refs/heads/branch-2.0 Commit: 603424c161e9be670ee8461053225364cc700515 Parents: 10c476f Author: Reynold Xin Authored: Mon Jun 20 21:46:12 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:46:20 2016 -0700 -- python/pyspark/sql/readwriter.py| 4 ++ .../org/apache/spark/sql/DataFrameReader.scala | 2 + .../datasources/csv/CSVFileFormat.scala | 9 - .../execution/datasources/csv/CSVOptions.scala | 2 + .../execution/datasources/csv/CSVRelation.scala | 42 +--- 5 files changed, 44 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 72fd184..89506ca 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils): :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``100``. +:param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will +log for each partition. Malformed records beyond this +number will be ignored. If None is set, it +uses the default value, ``10``. :param mode: allows a mode for dealing with corrupt records during parsing. If None is set, it uses the default value, ``PERMISSIVE``. http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 841503b..35ba9c5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -382,6 +382,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * a record can have. * `maxCharsPerColumn` (default `100`): defines the maximum number of characters allowed * for any given value being read. + * `maxMalformedLogPerPartition` (default `10`): sets the maximum number of malformed rows + * Spark will log for each partition. Malformed records beyond this number will be ignored. * `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records *during parsing. * http://git-wip-us.apache.org/repos/asf/spark/blob/603424c1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala -- diff --git
spark git commit: [SPARK-13792][SQL] Limit logging of bad records in CSV data source
Repository: spark Updated Branches: refs/heads/master 217db56ba -> c775bf09e [SPARK-13792][SQL] Limit logging of bad records in CSV data source ## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin Closes #13795 from rxin/SPARK-13792. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c775bf09 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c775bf09 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c775bf09 Branch: refs/heads/master Commit: c775bf09e0c3540f76de3f15d3fd35112a4912c1 Parents: 217db56 Author: Reynold Xin Authored: Mon Jun 20 21:46:12 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 21:46:12 2016 -0700 -- python/pyspark/sql/readwriter.py| 4 ++ .../org/apache/spark/sql/DataFrameReader.scala | 2 + .../datasources/csv/CSVFileFormat.scala | 9 - .../execution/datasources/csv/CSVOptions.scala | 2 + .../execution/datasources/csv/CSVRelation.scala | 42 +--- 5 files changed, 44 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 72fd184..89506ca 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils): :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``100``. +:param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will +log for each partition. Malformed records beyond this +number will be ignored. If None is set, it +uses the default value, ``10``. :param mode: allows a mode for dealing with corrupt records during parsing. If None is set, it uses the default value, ``PERMISSIVE``. http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 841503b..35ba9c5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -382,6 +382,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * a record can have. * `maxCharsPerColumn` (default `100`): defines the maximum number of characters allowed * for any given value being read. + * `maxMalformedLogPerPartition` (default `10`): sets the maximum number of malformed rows + * Spark will log for each partition. Malformed records beyond this number will be ignored. * `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records *during parsing. * http://git-wip-us.apache.org/repos/asf/spark/blob/c775bf09/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala b/sql/core/
spark git commit: [SPARK-15294][R] Add `pivot` to SparkR
Repository: spark Updated Branches: refs/heads/branch-2.0 087bd2799 -> 10c476fc8 [SPARK-15294][R] Add `pivot` to SparkR ## What changes were proposed in this pull request? This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (including new testcase.) Author: Dongjoon Hyun Closes #13786 from dongjoon-hyun/SPARK-15294. (cherry picked from commit 217db56ba11fcdf9e3a81946667d1d99ad7344ee) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10c476fc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10c476fc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10c476fc Branch: refs/heads/branch-2.0 Commit: 10c476fc8f4780e487d8ada626f6924866f5711f Parents: 087bd27 Author: Dongjoon Hyun Authored: Mon Jun 20 21:09:39 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 21:09:51 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/generics.R| 4 +++ R/pkg/R/group.R | 43 ++ R/pkg/inst/tests/testthat/test_sparkSQL.R | 25 +++ 4 files changed, 73 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 45663f4..ea42888 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -294,6 +294,7 @@ exportMethods("%in%", exportClasses("GroupedData") exportMethods("agg") +exportMethods("pivot") export("as.DataFrame", "cacheTable", http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 3fb6370..c307de7 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -160,6 +160,10 @@ setGeneric("persist", function(x, newLevel) { standardGeneric("persist") }) # @export setGeneric("pipeRDD", function(x, command, env = list()) { standardGeneric("pipeRDD")}) +# @rdname pivot +# @export +setGeneric("pivot", function(x, colname, values = list()) { standardGeneric("pivot") }) + # @rdname reduce # @export setGeneric("reduce", function(x, func) { standardGeneric("reduce") }) http://git-wip-us.apache.org/repos/asf/spark/blob/10c476fc/R/pkg/R/group.R -- diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R index 51e1516..0687f14 100644 --- a/R/pkg/R/group.R +++ b/R/pkg/R/group.R @@ -134,6 +134,49 @@ methods <- c("avg", "max", "mean", "min", "sum") # These are not exposed on GroupedData: "kurtosis", "skewness", "stddev", "stddev_samp", "stddev_pop", # "variance", "var_samp", "var_pop" +#' Pivot a column of the GroupedData and perform the specified aggregation. +#' +#' Pivot a column of the GroupedData and perform the specified aggregation. +#' There are two versions of pivot function: one that requires the caller to specify the list +#' of distinct values to pivot on, and one that does not. The latter is more concise but less +#' efficient, because Spark needs to first compute the list of distinct values internally. +#' +#' @param x a GroupedData object +#' @param colname A column name +#' @param values A value or a list/vector of distinct values for the output columns. +#' @return GroupedData object +#' @rdname pivot +#' @name pivot +#' @export +#' @examples +#' \dontrun{ +#' df <- createDataFrame(data.frame( +#' earnings = c(1, 1, 11000, 15000, 12000, 2, 21000, 22000), +#' course = c("R", "Python", "R", "Python", "R", "Python", "R", "Python"), +#' period = c("1H", "1H", "2H", "2H", "1H", "1H", "2H", "2H"), +#' year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016) +#' )) +#' group_sum <- sum(pivot(groupBy(df, "year"), "course"), "earnings") +#' group_min <- min(pivot(groupBy(df, "year"), "course", "R"), "earnings") +#' group_max <- max(pivot(groupBy(df, "year"), "course", c("Python", "R")), "earnings") +#' group_mean <- mean(pivot(groupBy(df, "year"), "course", list("Python", "R")), "earnings") +#' } +#' @note pivot since 2.0.0 +setMethod("pivot", + signature(x = "GroupedData", colname = "character"), + function(x, colname, values = list()){ +stopifnot(length(colname) == 1) +if (length(values) == 0) { + result <- callJMethod(x@sgd, "pivot", colname) +} else { + if (length(values) > length(unique(values))) { +stop("Values are not unique") + } +
spark git commit: [SPARK-15294][R] Add `pivot` to SparkR
Repository: spark Updated Branches: refs/heads/master a46553cba -> 217db56ba [SPARK-15294][R] Add `pivot` to SparkR ## What changes were proposed in this pull request? This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (including new testcase.) Author: Dongjoon Hyun Closes #13786 from dongjoon-hyun/SPARK-15294. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/217db56b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/217db56b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/217db56b Branch: refs/heads/master Commit: 217db56ba11fcdf9e3a81946667d1d99ad7344ee Parents: a46553c Author: Dongjoon Hyun Authored: Mon Jun 20 21:09:39 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 21:09:39 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/generics.R| 4 +++ R/pkg/R/group.R | 43 ++ R/pkg/inst/tests/testthat/test_sparkSQL.R | 25 +++ 4 files changed, 73 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 45663f4..ea42888 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -294,6 +294,7 @@ exportMethods("%in%", exportClasses("GroupedData") exportMethods("agg") +exportMethods("pivot") export("as.DataFrame", "cacheTable", http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 3fb6370..c307de7 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -160,6 +160,10 @@ setGeneric("persist", function(x, newLevel) { standardGeneric("persist") }) # @export setGeneric("pipeRDD", function(x, command, env = list()) { standardGeneric("pipeRDD")}) +# @rdname pivot +# @export +setGeneric("pivot", function(x, colname, values = list()) { standardGeneric("pivot") }) + # @rdname reduce # @export setGeneric("reduce", function(x, func) { standardGeneric("reduce") }) http://git-wip-us.apache.org/repos/asf/spark/blob/217db56b/R/pkg/R/group.R -- diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R index 51e1516..0687f14 100644 --- a/R/pkg/R/group.R +++ b/R/pkg/R/group.R @@ -134,6 +134,49 @@ methods <- c("avg", "max", "mean", "min", "sum") # These are not exposed on GroupedData: "kurtosis", "skewness", "stddev", "stddev_samp", "stddev_pop", # "variance", "var_samp", "var_pop" +#' Pivot a column of the GroupedData and perform the specified aggregation. +#' +#' Pivot a column of the GroupedData and perform the specified aggregation. +#' There are two versions of pivot function: one that requires the caller to specify the list +#' of distinct values to pivot on, and one that does not. The latter is more concise but less +#' efficient, because Spark needs to first compute the list of distinct values internally. +#' +#' @param x a GroupedData object +#' @param colname A column name +#' @param values A value or a list/vector of distinct values for the output columns. +#' @return GroupedData object +#' @rdname pivot +#' @name pivot +#' @export +#' @examples +#' \dontrun{ +#' df <- createDataFrame(data.frame( +#' earnings = c(1, 1, 11000, 15000, 12000, 2, 21000, 22000), +#' course = c("R", "Python", "R", "Python", "R", "Python", "R", "Python"), +#' period = c("1H", "1H", "2H", "2H", "1H", "1H", "2H", "2H"), +#' year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016) +#' )) +#' group_sum <- sum(pivot(groupBy(df, "year"), "course"), "earnings") +#' group_min <- min(pivot(groupBy(df, "year"), "course", "R"), "earnings") +#' group_max <- max(pivot(groupBy(df, "year"), "course", c("Python", "R")), "earnings") +#' group_mean <- mean(pivot(groupBy(df, "year"), "course", list("Python", "R")), "earnings") +#' } +#' @note pivot since 2.0.0 +setMethod("pivot", + signature(x = "GroupedData", colname = "character"), + function(x, colname, values = list()){ +stopifnot(length(colname) == 1) +if (length(values) == 0) { + result <- callJMethod(x@sgd, "pivot", colname) +} else { + if (length(values) > length(unique(values))) { +stop("Values are not unique") + } + result <- callJMethod(x@sgd, "pivot", colname, as.list(values)) +} +groupedData(resu
spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)
Repository: spark Updated Branches: refs/heads/master e2b7eba87 -> a46553cba [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) Fix the bug for Python UDF that does not have any arguments. Added regression tests. Author: Davies Liu Closes #13793 from davies/fix_no_arguments. (cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a46553cb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a46553cb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a46553cb Branch: refs/heads/master Commit: a46553cbacf0e4012df89fe55385dec5beaa680a Parents: e2b7eba Author: Davies Liu Authored: Mon Jun 20 20:50:30 2016 -0700 Committer: Davies Liu Committed: Mon Jun 20 20:53:45 2016 -0700 -- python/pyspark/sql/tests.py | 5 + python/pyspark/sql/types.py | 9 +++-- 2 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a46553cb/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index c631ad8..ecd1a05 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -318,6 +318,11 @@ class SQLTests(ReusedPySparkTestCase): [row] = self.spark.sql("SELECT double(add(1, 2)), add(double(2), 1)").collect() self.assertEqual(tuple(row), (6, 5)) +def test_udf_without_arguments(self): +self.sqlCtx.registerFunction("foo", lambda: "bar") +[row] = self.sqlCtx.sql("SELECT foo()").collect() +self.assertEqual(row[0], "bar") + def test_udf_with_array_type(self): d = [Row(l=list(range(3)), d={"key": list(range(5))})] rdd = self.sc.parallelize(d) http://git-wip-us.apache.org/repos/asf/spark/blob/a46553cb/python/pyspark/sql/types.py -- diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index bb2b954..f0b56be 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -1401,11 +1401,7 @@ class Row(tuple): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") -if args: -# create row class or objects -return tuple.__new__(self, args) - -elif kwargs: +if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) @@ -1413,7 +1409,8 @@ class Row(tuple): return row else: -raise ValueError("No args or kwargs") +# create row class or objects +return tuple.__new__(self, args) def asDict(self, recursive=False): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)
Repository: spark Updated Branches: refs/heads/branch-2.0 f57317690 -> 087bd2799 [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) Fix the bug for Python UDF that does not have any arguments. Added regression tests. Author: Davies Liu Closes #13793 from davies/fix_no_arguments. (cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/087bd279 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/087bd279 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/087bd279 Branch: refs/heads/branch-2.0 Commit: 087bd2799366f4914d248e9b1f0fb921adbbdb43 Parents: f573176 Author: Davies Liu Authored: Mon Jun 20 20:50:30 2016 -0700 Committer: Davies Liu Committed: Mon Jun 20 20:52:55 2016 -0700 -- python/pyspark/sql/tests.py | 5 + python/pyspark/sql/types.py | 9 +++-- 2 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/087bd279/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index c631ad8..ecd1a05 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -318,6 +318,11 @@ class SQLTests(ReusedPySparkTestCase): [row] = self.spark.sql("SELECT double(add(1, 2)), add(double(2), 1)").collect() self.assertEqual(tuple(row), (6, 5)) +def test_udf_without_arguments(self): +self.sqlCtx.registerFunction("foo", lambda: "bar") +[row] = self.sqlCtx.sql("SELECT foo()").collect() +self.assertEqual(row[0], "bar") + def test_udf_with_array_type(self): d = [Row(l=list(range(3)), d={"key": list(range(5))})] rdd = self.sc.parallelize(d) http://git-wip-us.apache.org/repos/asf/spark/blob/087bd279/python/pyspark/sql/types.py -- diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index bb2b954..f0b56be 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -1401,11 +1401,7 @@ class Row(tuple): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") -if args: -# create row class or objects -return tuple.__new__(self, args) - -elif kwargs: +if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) @@ -1413,7 +1409,8 @@ class Row(tuple): return row else: -raise ValueError("No args or kwargs") +# create row class or objects +return tuple.__new__(self, args) def asDict(self, recursive=False): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)
Repository: spark Updated Branches: refs/heads/branch-1.5 1891e04a6 -> 6001138fd [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) ## What changes were proposed in this pull request? Fix the bug for Python UDF that does not have any arguments. ## How was this patch tested? Added regression tests. Author: Davies Liu Closes #13793 from davies/fix_no_arguments. (cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6001138f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6001138f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6001138f Branch: refs/heads/branch-1.5 Commit: 6001138fd68f2318028519d09563f12874b54e7d Parents: 1891e04 Author: Davies Liu Authored: Mon Jun 20 20:50:30 2016 -0700 Committer: Davies Liu Committed: Mon Jun 20 20:50:57 2016 -0700 -- python/pyspark/sql/tests.py | 5 + python/pyspark/sql/types.py | 9 +++-- 2 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6001138f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 27c9d45..86e2dfb 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -286,6 +286,11 @@ class SQLTests(ReusedPySparkTestCase): [res] = self.sqlCtx.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 1").collect() self.assertEqual(4, res[0]) +def test_udf_without_arguments(self): +self.sqlCtx.registerFunction("foo", lambda: "bar") +[row] = self.sqlCtx.sql("SELECT foo()").collect() +self.assertEqual(row[0], "bar") + def test_udf_with_array_type(self): d = [Row(l=list(range(3)), d={"key": list(range(5))})] rdd = self.sc.parallelize(d) http://git-wip-us.apache.org/repos/asf/spark/blob/6001138f/python/pyspark/sql/types.py -- diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index b0ac207..db4cc42 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -1193,11 +1193,7 @@ class Row(tuple): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") -if args: -# create row class or objects -return tuple.__new__(self, args) - -elif kwargs: +if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) @@ -1205,7 +1201,8 @@ class Row(tuple): return row else: -raise ValueError("No args or kwargs") +# create row class or objects +return tuple.__new__(self, args) def asDict(self, recursive=False): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)
Repository: spark Updated Branches: refs/heads/branch-1.6 db86e7fd2 -> abe36c53d [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) ## What changes were proposed in this pull request? Fix the bug for Python UDF that does not have any arguments. ## How was this patch tested? Added regression tests. Author: Davies Liu Closes #13793 from davies/fix_no_arguments. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/abe36c53 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/abe36c53 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/abe36c53 Branch: refs/heads/branch-1.6 Commit: abe36c53d126bb580e408a45245fd8e81806869c Parents: db86e7f Author: Davies Liu Authored: Mon Jun 20 20:50:30 2016 -0700 Committer: Davies Liu Committed: Mon Jun 20 20:50:30 2016 -0700 -- python/pyspark/sql/tests.py | 5 + python/pyspark/sql/types.py | 9 +++-- 2 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/abe36c53/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 0dc4274..43eb6ec 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -305,6 +305,11 @@ class SQLTests(ReusedPySparkTestCase): [res] = self.sqlCtx.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 1").collect() self.assertEqual(4, res[0]) +def test_udf_without_arguments(self): +self.sqlCtx.registerFunction("foo", lambda: "bar") +[row] = self.sqlCtx.sql("SELECT foo()").collect() +self.assertEqual(row[0], "bar") + def test_udf_with_array_type(self): d = [Row(l=list(range(3)), d={"key": list(range(5))})] rdd = self.sc.parallelize(d) http://git-wip-us.apache.org/repos/asf/spark/blob/abe36c53/python/pyspark/sql/types.py -- diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 5bc0773..211b01f 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -1195,11 +1195,7 @@ class Row(tuple): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") -if args: -# create row class or objects -return tuple.__new__(self, args) - -elif kwargs: +if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) @@ -1207,7 +1203,8 @@ class Row(tuple): return row else: -raise ValueError("No args or kwargs") +# create row class or objects +return tuple.__new__(self, args) def asDict(self, recursive=False): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: remove duplicated docs in dapply
Repository: spark Updated Branches: refs/heads/branch-2.0 c7006538a -> f57317690 remove duplicated docs in dapply ## What changes were proposed in this pull request? Removed unnecessary duplicated documentation in dapply and dapplyCollect. In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link. ## How was this patch tested? Existing test cases. Author: Narine Kokhlikyan Closes #13790 from NarineK/dapply-docs-fix. (cherry picked from commit e2b7eba87cdf67fa737c32f5f6ca075445ff28cb) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5731769 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5731769 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5731769 Branch: refs/heads/branch-2.0 Commit: f573176902ebff0fd6a2f572c94a2cca3e057b72 Parents: c700653 Author: Narine Kokhlikyan Authored: Mon Jun 20 19:36:51 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 19:36:58 2016 -0700 -- R/pkg/R/DataFrame.R | 4 +++- R/pkg/R/generics.R | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5731769/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index ecdcd6e..b3f2dd8 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -1250,6 +1250,7 @@ dapplyInternal <- function(x, func, schema) { #' @family SparkDataFrame functions #' @rdname dapply #' @name dapply +#' @seealso \link{dapplyCollect} #' @export #' @examples #' \dontrun{ @@ -1294,8 +1295,9 @@ setMethod("dapply", #' to each partition will be passed. #' The output of func should be a data.frame. #' @family SparkDataFrame functions -#' @rdname dapply +#' @rdname dapplyCollect #' @name dapplyCollect +#' @seealso \link{dapply} #' @export #' @examples #' \dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/f5731769/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index f6b9276..3fb6370 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -457,7 +457,7 @@ setGeneric("createOrReplaceTempView", #' @export setGeneric("dapply", function(x, func, schema) { standardGeneric("dapply") }) -#' @rdname dapply +#' @rdname dapplyCollect #' @export setGeneric("dapplyCollect", function(x, func) { standardGeneric("dapplyCollect") }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: remove duplicated docs in dapply
Repository: spark Updated Branches: refs/heads/master a42bf5553 -> e2b7eba87 remove duplicated docs in dapply ## What changes were proposed in this pull request? Removed unnecessary duplicated documentation in dapply and dapplyCollect. In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link. ## How was this patch tested? Existing test cases. Author: Narine Kokhlikyan Closes #13790 from NarineK/dapply-docs-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2b7eba8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2b7eba8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2b7eba8 Branch: refs/heads/master Commit: e2b7eba87cdf67fa737c32f5f6ca075445ff28cb Parents: a42bf55 Author: Narine Kokhlikyan Authored: Mon Jun 20 19:36:51 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 19:36:51 2016 -0700 -- R/pkg/R/DataFrame.R | 4 +++- R/pkg/R/generics.R | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e2b7eba8/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index ecdcd6e..b3f2dd8 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -1250,6 +1250,7 @@ dapplyInternal <- function(x, func, schema) { #' @family SparkDataFrame functions #' @rdname dapply #' @name dapply +#' @seealso \link{dapplyCollect} #' @export #' @examples #' \dontrun{ @@ -1294,8 +1295,9 @@ setMethod("dapply", #' to each partition will be passed. #' The output of func should be a data.frame. #' @family SparkDataFrame functions -#' @rdname dapply +#' @rdname dapplyCollect #' @name dapplyCollect +#' @seealso \link{dapply} #' @export #' @examples #' \dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/e2b7eba8/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index f6b9276..3fb6370 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -457,7 +457,7 @@ setGeneric("createOrReplaceTempView", #' @export setGeneric("dapply", function(x, func, schema) { standardGeneric("dapply") }) -#' @rdname dapply +#' @rdname dapplyCollect #' @export setGeneric("dapplyCollect", function(x, func) { standardGeneric("dapplyCollect") }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel
Repository: spark Updated Branches: refs/heads/branch-2.0 b40663541 -> c7006538a [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079. (cherry picked from commit a42bf555326b75c8251be77db68105c29e8c95c4) Signed-off-by: Xiangrui Meng Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7006538 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7006538 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7006538 Branch: refs/heads/branch-2.0 Commit: c7006538a88bee85e0292bc9564ae8bfdf734ed6 Parents: b406635 Author: Bryan Cutler Authored: Mon Jun 20 16:28:11 2016 -0700 Committer: Xiangrui Meng Committed: Mon Jun 20 16:28:19 2016 -0700 -- python/pyspark/ml/classification.py | 6 -- python/pyspark/ml/regression.py | 2 ++ 2 files changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c7006538/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 121b926..a3cd917 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -21,8 +21,8 @@ import warnings from pyspark import since, keyword_only from pyspark.ml import Estimator, Model from pyspark.ml.param.shared import * -from pyspark.ml.regression import ( -RandomForestParams, TreeEnsembleParams, DecisionTreeModel, TreeEnsembleModels) +from pyspark.ml.regression import DecisionTreeModel, DecisionTreeRegressionModel, \ +RandomForestParams, TreeEnsembleModels, TreeEnsembleParams from pyspark.ml.util import * from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams from pyspark.ml.wrapper import JavaWrapper @@ -798,6 +798,8 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol True >>> model.treeWeights == model2.treeWeights True +>>> model.trees +[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...] .. versionadded:: 1.4.0 """ http://git-wip-us.apache.org/repos/asf/spark/blob/c7006538/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index db31993..8d2378d 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -994,6 +994,8 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, True >>> model.treeWeights == model2.treeWeights True +>>> model.trees +[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...] .. versionadded:: 1.4.0 """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel
Repository: spark Updated Branches: refs/heads/master 6daa8cf1a -> a42bf5553 [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a42bf555 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a42bf555 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a42bf555 Branch: refs/heads/master Commit: a42bf555326b75c8251be77db68105c29e8c95c4 Parents: 6daa8cf Author: Bryan Cutler Authored: Mon Jun 20 16:28:11 2016 -0700 Committer: Xiangrui Meng Committed: Mon Jun 20 16:28:11 2016 -0700 -- python/pyspark/ml/classification.py | 6 -- python/pyspark/ml/regression.py | 2 ++ 2 files changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a42bf555/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 121b926..a3cd917 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -21,8 +21,8 @@ import warnings from pyspark import since, keyword_only from pyspark.ml import Estimator, Model from pyspark.ml.param.shared import * -from pyspark.ml.regression import ( -RandomForestParams, TreeEnsembleParams, DecisionTreeModel, TreeEnsembleModels) +from pyspark.ml.regression import DecisionTreeModel, DecisionTreeRegressionModel, \ +RandomForestParams, TreeEnsembleModels, TreeEnsembleParams from pyspark.ml.util import * from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams from pyspark.ml.wrapper import JavaWrapper @@ -798,6 +798,8 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol True >>> model.treeWeights == model2.treeWeights True +>>> model.trees +[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...] .. versionadded:: 1.4.0 """ http://git-wip-us.apache.org/repos/asf/spark/blob/a42bf555/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index db31993..8d2378d 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -994,6 +994,8 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, True >>> model.treeWeights == model2.treeWeights True +>>> model.trees +[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...] .. versionadded:: 1.4.0 """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval"
Repository: spark Updated Branches: refs/heads/master b99129cc4 -> 6daa8cf1a [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval" ## What changes were proposed in this pull request? The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot. ## How was this patch tested? Existing unit tests. Author: Kousuke Saruta Closes #13777 from sarutak/SPARK-16061. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6daa8cf1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6daa8cf1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6daa8cf1 Branch: refs/heads/master Commit: 6daa8cf1a642a669cd3a0305036c4390e4336a73 Parents: b99129c Author: Kousuke Saruta Authored: Mon Jun 20 15:12:40 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 15:12:40 2016 -0700 -- .../apache/spark/sql/execution/streaming/state/StateStore.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6daa8cf1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala index 9948292..0667653 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala @@ -115,7 +115,7 @@ case class KeyRemoved(key: UnsafeRow) extends StoreUpdate */ private[sql] object StateStore extends Logging { - val MAINTENANCE_INTERVAL_CONFIG = "spark.streaming.stateStore.maintenanceInterval" + val MAINTENANCE_INTERVAL_CONFIG = "spark.sql.streaming.stateStore.maintenanceInterval" val MAINTENANCE_INTERVAL_DEFAULT_SECS = 60 private val loadedProviders = new mutable.HashMap[StateStoreId, StateStoreProvider]() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval"
Repository: spark Updated Branches: refs/heads/branch-2.0 54001cb12 -> b40663541 [SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval" ## What changes were proposed in this pull request? The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot. ## How was this patch tested? Existing unit tests. Author: Kousuke Saruta Closes #13777 from sarutak/SPARK-16061. (cherry picked from commit 6daa8cf1a642a669cd3a0305036c4390e4336a73) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4066354 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4066354 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4066354 Branch: refs/heads/branch-2.0 Commit: b4066354141b933cdfdfdf266c6d4ff21338dcdf Parents: 54001cb Author: Kousuke Saruta Authored: Mon Jun 20 15:12:40 2016 -0700 Committer: Reynold Xin Committed: Mon Jun 20 15:12:45 2016 -0700 -- .../apache/spark/sql/execution/streaming/state/StateStore.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b4066354/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala index 9948292..0667653 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala @@ -115,7 +115,7 @@ case class KeyRemoved(key: UnsafeRow) extends StoreUpdate */ private[sql] object StateStore extends Logging { - val MAINTENANCE_INTERVAL_CONFIG = "spark.streaming.stateStore.maintenanceInterval" + val MAINTENANCE_INTERVAL_CONFIG = "spark.sql.streaming.stateStore.maintenanceInterval" val MAINTENANCE_INTERVAL_DEFAULT_SECS = 60 private val loadedProviders = new mutable.HashMap[StateStoreId, StateStoreProvider]() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc
Repository: spark Updated Branches: refs/heads/branch-2.0 8159da20e -> 54001cb12 [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das Closes #13727 from tdas/SPARK-15982. (cherry picked from commit b99129cc452defc266f6d357f5baab5f4ff37a36) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54001cb1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54001cb1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54001cb1 Branch: refs/heads/branch-2.0 Commit: 54001cb129674be9f2459368fb608367f52371c2 Parents: 8159da2 Author: Tathagata Das Authored: Mon Jun 20 14:52:28 2016 -0700 Committer: Shixiong Zhu Committed: Mon Jun 20 14:52:35 2016 -0700 -- .../org/apache/spark/sql/DataFrameReader.scala | 132 + .../sql/JavaDataFrameReaderWriterSuite.java | 158 .../sql/test/DataFrameReaderWriterSuite.scala | 186 --- 3 files changed, 420 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/54001cb1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 2ae854d..841503b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -119,13 +119,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(): DataFrame = { -val dataSource = - DataSource( -sparkSession, -userSpecifiedSchema = userSpecifiedSchema, -className = source, -options = extraOptions.toMap) -Dataset.ofRows(sparkSession, LogicalRelation(dataSource.resolveRelation())) +load(Seq.empty: _*) // force invocation of `load(...varargs...)` } /** @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load() +load(Seq(path): _*) // force invocation of `load(...varargs...)` } /** @@ -146,18 +140,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { */ @scala.annotation.varargs def load(paths: String*): DataFrame = { -if (paths.isEmpty) { - sparkSession.emptyDataFrame -} else { - sparkSession.baseRelationToDataFrame( -DataSource.apply( - sparkSession, - paths = paths, - userSpecifiedSchema = userSpecifiedSchema, - className = source, - options = extraOptions.toMap).resolveRelation()) -} +sparkSession.baseRelationToDataFrame( + DataSource.apply( +sparkSession, +paths = paths, +userSpecifiedSchema = userSpecifiedSchema, +className = source, +options = extraOptions.toMap).resolveRelation()) } + /** * Construct a [[DataFrame]] representing the database table accessible via JDBC URL * url named table and connection properties. @@ -247,11 +238,23 @@ class DataFrameReader private[sql](sparkSession: SparkSession) ex
spark git commit: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc
Repository: spark Updated Branches: refs/heads/master 6df8e3886 -> b99129cc4 [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das Closes #13727 from tdas/SPARK-15982. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b99129cc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b99129cc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b99129cc Branch: refs/heads/master Commit: b99129cc452defc266f6d357f5baab5f4ff37a36 Parents: 6df8e38 Author: Tathagata Das Authored: Mon Jun 20 14:52:28 2016 -0700 Committer: Shixiong Zhu Committed: Mon Jun 20 14:52:28 2016 -0700 -- .../org/apache/spark/sql/DataFrameReader.scala | 132 + .../sql/JavaDataFrameReaderWriterSuite.java | 158 .../sql/test/DataFrameReaderWriterSuite.scala | 186 --- 3 files changed, 420 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b99129cc/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 2ae854d..841503b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -119,13 +119,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(): DataFrame = { -val dataSource = - DataSource( -sparkSession, -userSpecifiedSchema = userSpecifiedSchema, -className = source, -options = extraOptions.toMap) -Dataset.ofRows(sparkSession, LogicalRelation(dataSource.resolveRelation())) +load(Seq.empty: _*) // force invocation of `load(...varargs...)` } /** @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load() +load(Seq(path): _*) // force invocation of `load(...varargs...)` } /** @@ -146,18 +140,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { */ @scala.annotation.varargs def load(paths: String*): DataFrame = { -if (paths.isEmpty) { - sparkSession.emptyDataFrame -} else { - sparkSession.baseRelationToDataFrame( -DataSource.apply( - sparkSession, - paths = paths, - userSpecifiedSchema = userSpecifiedSchema, - className = source, - options = extraOptions.toMap).resolveRelation()) -} +sparkSession.baseRelationToDataFrame( + DataSource.apply( +sparkSession, +paths = paths, +userSpecifiedSchema = userSpecifiedSchema, +className = source, +options = extraOptions.toMap).resolveRelation()) } + /** * Construct a [[DataFrame]] representing the database table accessible via JDBC URL * url named table and connection properties. @@ -247,11 +238,23 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { /** * Loads a JSON file (one object per line) and returns the result as a [[DataF
spark git commit: [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0
Repository: spark Updated Branches: refs/heads/branch-2.0 54aef1c14 -> 8159da20e [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0 ## What changes were proposed in this pull request? Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete. We may also want to add more examples for Scala/Java Dataset typed transformations. ## How was this patch tested? N/A Author: Cheng Lian Closes #13592 from liancheng/sql-programming-guide-2.0. (cherry picked from commit 6df8e3886063a9d8c2e8499456ea9166245d5640) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8159da20 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8159da20 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8159da20 Branch: refs/heads/branch-2.0 Commit: 8159da20ee9c170324772792f2b242a85cbb7d34 Parents: 54aef1c Author: Cheng Lian Authored: Mon Jun 20 14:50:28 2016 -0700 Committer: Yin Huai Committed: Mon Jun 20 14:50:46 2016 -0700 -- docs/sql-programming-guide.md | 605 +++-- 1 file changed, 317 insertions(+), 288 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8159da20/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index efdf873..d93f30b 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the -computation. This unification means that developers can easily switch back and forth between the -various APIs based on which provides the most natural way to express a given transformation. +computation. This unification means that developers can easily switch back and forth between +different APIs based on which provides the most natural way to express a given transformation. All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. ## SQL -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. +One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames). +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes). You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli) or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server). -## DataFrames +## Datasets and DataFrames -A DataFrame is a distributed collection of data organized into named columns. It is conceptually -equivalent to a table in a relational database or a data frame in R/Python, but with richer -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such -as: structured data files, tables in Hive, external databases, or existing RDDs. +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.). -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html). +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s. +In fact, `DataFrame` is si
spark git commit: [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0
Repository: spark Updated Branches: refs/heads/master d0eddb80e -> 6df8e3886 [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0 ## What changes were proposed in this pull request? Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete. We may also want to add more examples for Scala/Java Dataset typed transformations. ## How was this patch tested? N/A Author: Cheng Lian Closes #13592 from liancheng/sql-programming-guide-2.0. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6df8e388 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6df8e388 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6df8e388 Branch: refs/heads/master Commit: 6df8e3886063a9d8c2e8499456ea9166245d5640 Parents: d0eddb8 Author: Cheng Lian Authored: Mon Jun 20 14:50:28 2016 -0700 Committer: Yin Huai Committed: Mon Jun 20 14:50:28 2016 -0700 -- docs/sql-programming-guide.md | 605 +++-- 1 file changed, 317 insertions(+), 288 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6df8e388/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index efdf873..d93f30b 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the -computation. This unification means that developers can easily switch back and forth between the -various APIs based on which provides the most natural way to express a given transformation. +computation. This unification means that developers can easily switch back and forth between +different APIs based on which provides the most natural way to express a given transformation. All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. ## SQL -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. +One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames). +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes). You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli) or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server). -## DataFrames +## Datasets and DataFrames -A DataFrame is a distributed collection of data organized into named columns. It is conceptually -equivalent to a table in a relational database or a data frame in R/Python, but with richer -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such -as: structured data files, tables in Hive, external databases, or existing RDDs. +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.). -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html). +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s. +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets]. +However, [Java API][java-data
[1/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods
Repository: spark Updated Branches: refs/heads/branch-2.0 f90b2ea1d -> 54aef1c14 http://git-wip-us.apache.org/repos/asf/spark/blob/54aef1c1/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 2127dae..d6ff2aa 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -29,24 +29,28 @@ #' #' @param jobj a Java object reference to the backing Scala GeneralizedLinearRegressionWrapper #' @export +#' @note GeneralizedLinearRegressionModel since 2.0.0 setClass("GeneralizedLinearRegressionModel", representation(jobj = "jobj")) #' S4 class that represents a NaiveBayesModel #' #' @param jobj a Java object reference to the backing Scala NaiveBayesWrapper #' @export +#' @note NaiveBayesModel since 2.0.0 setClass("NaiveBayesModel", representation(jobj = "jobj")) #' S4 class that represents a AFTSurvivalRegressionModel #' #' @param jobj a Java object reference to the backing Scala AFTSurvivalRegressionWrapper #' @export +#' @note AFTSurvivalRegressionModel since 2.0.0 setClass("AFTSurvivalRegressionModel", representation(jobj = "jobj")) #' S4 class that represents a KMeansModel #' #' @param jobj a Java object reference to the backing Scala KMeansModel #' @export +#' @note KMeansModel since 2.0.0 setClass("KMeansModel", representation(jobj = "jobj")) #' Fits a generalized linear model @@ -73,6 +77,7 @@ setClass("KMeansModel", representation(jobj = "jobj")) #' model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family="gaussian") #' summary(model) #' } +#' @note spark.glm since 2.0.0 setMethod( "spark.glm", signature(data = "SparkDataFrame", formula = "formula"), @@ -120,6 +125,7 @@ setMethod( #' model <- glm(Sepal_Length ~ Sepal_Width, df, family="gaussian") #' summary(model) #' } +#' @note glm since 1.5.0 setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDataFrame"), function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 25) { spark.glm(data, formula, family, epsilon, maxit) @@ -138,6 +144,7 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat #' model <- glm(y ~ x, trainingData) #' summary(model) #' } +#' @note summary(GeneralizedLinearRegressionModel) since 2.0.0 setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), function(object, ...) { jobj <- object@jobj @@ -173,6 +180,7 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), #' @rdname print #' @name print.summary.GeneralizedLinearRegressionModel #' @export +#' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { if (x$is.loaded) { cat("\nSaved-loaded model does not support output 'Deviance Residuals'.\n") @@ -215,6 +223,7 @@ print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { #' predicted <- predict(model, testData) #' showDF(predicted) #' } +#' @note predict(GeneralizedLinearRegressionModel) since 1.5.0 setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) @@ -236,6 +245,7 @@ setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), #' predicted <- predict(model, testData) #' showDF(predicted) #'} +#' @note predict(NaiveBayesModel) since 2.0.0 setMethod("predict", signature(object = "NaiveBayesModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) @@ -256,6 +266,7 @@ setMethod("predict", signature(object = "NaiveBayesModel"), #' model <- spark.naiveBayes(trainingData, y ~ x) #' summary(model) #'} +#' @note summary(NaiveBayesModel) since 2.0.0 setMethod("summary", signature(object = "NaiveBayesModel"), function(object, ...) { jobj <- object@jobj @@ -289,6 +300,7 @@ setMethod("summary", signature(object = "NaiveBayesModel"), #' \dontrun{ #' model <- spark.kmeans(data, ~ ., k=2, initMode="random") #' } +#' @note spark.kmeans since 2.0.0 setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = "formula"), function(data, formula, k, maxIter = 10, initMode = c("random", "k-means||")) { formula <- paste(deparse(formula), collapse = "") @@ -313,6 +325,7 @@ setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = "formula" #' fitted.model <- fitted(model) #' showDF(fitted.model) #'} +#' @note fitted since 2.0.0 setMethod("fitted", signature(object = "KMeansModel"), function(object, method = c("centers", "classes"), ...) { method <- match.arg(method) @@ -339,6 +352,7 @@ setMethod("fitted", signature(object = "KMeansModel"), #' model <- spark.kmeans(trainingData, ~ ., 2) #' summary(model) #' } +#'
[1/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods
Repository: spark Updated Branches: refs/heads/master 92514232e -> d0eddb80e http://git-wip-us.apache.org/repos/asf/spark/blob/d0eddb80/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 2127dae..d6ff2aa 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -29,24 +29,28 @@ #' #' @param jobj a Java object reference to the backing Scala GeneralizedLinearRegressionWrapper #' @export +#' @note GeneralizedLinearRegressionModel since 2.0.0 setClass("GeneralizedLinearRegressionModel", representation(jobj = "jobj")) #' S4 class that represents a NaiveBayesModel #' #' @param jobj a Java object reference to the backing Scala NaiveBayesWrapper #' @export +#' @note NaiveBayesModel since 2.0.0 setClass("NaiveBayesModel", representation(jobj = "jobj")) #' S4 class that represents a AFTSurvivalRegressionModel #' #' @param jobj a Java object reference to the backing Scala AFTSurvivalRegressionWrapper #' @export +#' @note AFTSurvivalRegressionModel since 2.0.0 setClass("AFTSurvivalRegressionModel", representation(jobj = "jobj")) #' S4 class that represents a KMeansModel #' #' @param jobj a Java object reference to the backing Scala KMeansModel #' @export +#' @note KMeansModel since 2.0.0 setClass("KMeansModel", representation(jobj = "jobj")) #' Fits a generalized linear model @@ -73,6 +77,7 @@ setClass("KMeansModel", representation(jobj = "jobj")) #' model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family="gaussian") #' summary(model) #' } +#' @note spark.glm since 2.0.0 setMethod( "spark.glm", signature(data = "SparkDataFrame", formula = "formula"), @@ -120,6 +125,7 @@ setMethod( #' model <- glm(Sepal_Length ~ Sepal_Width, df, family="gaussian") #' summary(model) #' } +#' @note glm since 1.5.0 setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDataFrame"), function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 25) { spark.glm(data, formula, family, epsilon, maxit) @@ -138,6 +144,7 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat #' model <- glm(y ~ x, trainingData) #' summary(model) #' } +#' @note summary(GeneralizedLinearRegressionModel) since 2.0.0 setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), function(object, ...) { jobj <- object@jobj @@ -173,6 +180,7 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), #' @rdname print #' @name print.summary.GeneralizedLinearRegressionModel #' @export +#' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { if (x$is.loaded) { cat("\nSaved-loaded model does not support output 'Deviance Residuals'.\n") @@ -215,6 +223,7 @@ print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { #' predicted <- predict(model, testData) #' showDF(predicted) #' } +#' @note predict(GeneralizedLinearRegressionModel) since 1.5.0 setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) @@ -236,6 +245,7 @@ setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), #' predicted <- predict(model, testData) #' showDF(predicted) #'} +#' @note predict(NaiveBayesModel) since 2.0.0 setMethod("predict", signature(object = "NaiveBayesModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) @@ -256,6 +266,7 @@ setMethod("predict", signature(object = "NaiveBayesModel"), #' model <- spark.naiveBayes(trainingData, y ~ x) #' summary(model) #'} +#' @note summary(NaiveBayesModel) since 2.0.0 setMethod("summary", signature(object = "NaiveBayesModel"), function(object, ...) { jobj <- object@jobj @@ -289,6 +300,7 @@ setMethod("summary", signature(object = "NaiveBayesModel"), #' \dontrun{ #' model <- spark.kmeans(data, ~ ., k=2, initMode="random") #' } +#' @note spark.kmeans since 2.0.0 setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = "formula"), function(data, formula, k, maxIter = 10, initMode = c("random", "k-means||")) { formula <- paste(deparse(formula), collapse = "") @@ -313,6 +325,7 @@ setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = "formula" #' fitted.model <- fitted(model) #' showDF(fitted.model) #'} +#' @note fitted since 2.0.0 setMethod("fitted", signature(object = "KMeansModel"), function(object, method = c("centers", "classes"), ...) { method <- match.arg(method) @@ -339,6 +352,7 @@ setMethod("fitted", signature(object = "KMeansModel"), #' model <- spark.kmeans(trainingData, ~ ., 2) #' summary(model) #' } +#' @not
[2/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods
[SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods ## What changes were proposed in this pull request? This PR adds `since` tags to Roxygen documentation according to the previous documentation archive. https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/ ## How was this patch tested? Manual. Author: Dongjoon Hyun Closes #13734 from dongjoon-hyun/SPARK-14995. (cherry picked from commit d0eddb80eca04e4f5f8af3b5143096cf67200277) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54aef1c1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54aef1c1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54aef1c1 Branch: refs/heads/branch-2.0 Commit: 54aef1c1414589b5143ec3cbbf3b1e17648b7067 Parents: f90b2ea Author: Dongjoon Hyun Authored: Mon Jun 20 14:24:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 14:24:48 2016 -0700 -- R/pkg/R/DataFrame.R | 93 +++- R/pkg/R/SQLContext.R | 42 ++--- R/pkg/R/WindowSpec.R | 8 +++ R/pkg/R/column.R | 10 +++ R/pkg/R/context.R| 3 +- R/pkg/R/functions.R | 153 ++ R/pkg/R/group.R | 6 ++ R/pkg/R/jobj.R | 1 + R/pkg/R/mllib.R | 24 R/pkg/R/schema.R | 5 +- R/pkg/R/sparkR.R | 18 +++--- R/pkg/R/stats.R | 6 ++ R/pkg/R/utils.R | 1 + R/pkg/R/window.R | 4 ++ 14 files changed, 340 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/54aef1c1/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 583d3ae..ecdcd6e 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -25,7 +25,7 @@ setOldClass("structType") #' S4 class that represents a SparkDataFrame #' -#' DataFrames can be created using functions like \link{createDataFrame}, +#' SparkDataFrames can be created using functions like \link{createDataFrame}, #' \link{read.json}, \link{table} etc. #' #' @family SparkDataFrame functions @@ -42,6 +42,7 @@ setOldClass("structType") #' sparkR.session() #' df <- createDataFrame(faithful) #'} +#' @note SparkDataFrame since 2.0.0 setClass("SparkDataFrame", slots = list(env = "environment", sdf = "jobj")) @@ -81,6 +82,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' df <- read.json(path) #' printSchema(df) #'} +#' @note printSchema since 1.4.0 setMethod("printSchema", signature(x = "SparkDataFrame"), function(x) { @@ -105,6 +107,7 @@ setMethod("printSchema", #' df <- read.json(path) #' dfSchema <- schema(df) #'} +#' @note schema since 1.4.0 setMethod("schema", signature(x = "SparkDataFrame"), function(x) { @@ -128,6 +131,7 @@ setMethod("schema", #' df <- read.json(path) #' explain(df, TRUE) #'} +#' @note explain since 1.4.0 setMethod("explain", signature(x = "SparkDataFrame"), function(x, extended = FALSE) { @@ -158,6 +162,7 @@ setMethod("explain", #' df <- read.json(path) #' isLocal(df) #'} +#' @note isLocal since 1.4.0 setMethod("isLocal", signature(x = "SparkDataFrame"), function(x) { @@ -182,6 +187,7 @@ setMethod("isLocal", #' df <- read.json(path) #' showDF(df) #'} +#' @note showDF since 1.4.0 setMethod("showDF", signature(x = "SparkDataFrame"), function(x, numRows = 20, truncate = TRUE) { @@ -206,6 +212,7 @@ setMethod("showDF", #' df <- read.json(path) #' df #'} +#' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { cols <- lapply(dtypes(object), function(l) { @@ -232,6 +239,7 @@ setMethod("show", "SparkDataFrame", #' df <- read.json(path) #' dtypes(df) #'} +#' @note dtypes since 1.4.0 setMethod("dtypes", signature(x = "SparkDataFrame"), function(x) { @@ -259,6 +267,7 @@ setMethod("dtypes", #' columns(df) #' colnames(df) #'} +#' @note columns since 1.4.0 setMethod("columns", signature(x = "SparkDataFrame"), function(x) { @@ -269,6 +278,7 @@ setMethod("columns", #' @rdname columns #' @name names +#' @note names since 1.5.0 setMethod("names", signature(x = "SparkDataFrame"), function(x) { @@ -277,6 +287,7 @@ setMethod("names", #' @rdname columns #' @name names<- +#' @note names<- since 1.5.0 setMethod("names<-", signature(x = "SparkDataFrame"), function(x, value) { @@ -288,6 +299,7 @@ setMethod("names<-", #' @rdname columns #' @name colnames +#' @note colnames since 1.6.0 setMethod("colnames", signature(x = "SparkDataFrame")
[2/2] spark git commit: [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods
[SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods ## What changes were proposed in this pull request? This PR adds `since` tags to Roxygen documentation according to the previous documentation archive. https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/ ## How was this patch tested? Manual. Author: Dongjoon Hyun Closes #13734 from dongjoon-hyun/SPARK-14995. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d0eddb80 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d0eddb80 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d0eddb80 Branch: refs/heads/master Commit: d0eddb80eca04e4f5f8af3b5143096cf67200277 Parents: 9251423 Author: Dongjoon Hyun Authored: Mon Jun 20 14:24:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 14:24:41 2016 -0700 -- R/pkg/R/DataFrame.R | 93 +++- R/pkg/R/SQLContext.R | 42 ++--- R/pkg/R/WindowSpec.R | 8 +++ R/pkg/R/column.R | 10 +++ R/pkg/R/context.R| 3 +- R/pkg/R/functions.R | 153 ++ R/pkg/R/group.R | 6 ++ R/pkg/R/jobj.R | 1 + R/pkg/R/mllib.R | 24 R/pkg/R/schema.R | 5 +- R/pkg/R/sparkR.R | 18 +++--- R/pkg/R/stats.R | 6 ++ R/pkg/R/utils.R | 1 + R/pkg/R/window.R | 4 ++ 14 files changed, 340 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d0eddb80/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 583d3ae..ecdcd6e 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -25,7 +25,7 @@ setOldClass("structType") #' S4 class that represents a SparkDataFrame #' -#' DataFrames can be created using functions like \link{createDataFrame}, +#' SparkDataFrames can be created using functions like \link{createDataFrame}, #' \link{read.json}, \link{table} etc. #' #' @family SparkDataFrame functions @@ -42,6 +42,7 @@ setOldClass("structType") #' sparkR.session() #' df <- createDataFrame(faithful) #'} +#' @note SparkDataFrame since 2.0.0 setClass("SparkDataFrame", slots = list(env = "environment", sdf = "jobj")) @@ -81,6 +82,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' df <- read.json(path) #' printSchema(df) #'} +#' @note printSchema since 1.4.0 setMethod("printSchema", signature(x = "SparkDataFrame"), function(x) { @@ -105,6 +107,7 @@ setMethod("printSchema", #' df <- read.json(path) #' dfSchema <- schema(df) #'} +#' @note schema since 1.4.0 setMethod("schema", signature(x = "SparkDataFrame"), function(x) { @@ -128,6 +131,7 @@ setMethod("schema", #' df <- read.json(path) #' explain(df, TRUE) #'} +#' @note explain since 1.4.0 setMethod("explain", signature(x = "SparkDataFrame"), function(x, extended = FALSE) { @@ -158,6 +162,7 @@ setMethod("explain", #' df <- read.json(path) #' isLocal(df) #'} +#' @note isLocal since 1.4.0 setMethod("isLocal", signature(x = "SparkDataFrame"), function(x) { @@ -182,6 +187,7 @@ setMethod("isLocal", #' df <- read.json(path) #' showDF(df) #'} +#' @note showDF since 1.4.0 setMethod("showDF", signature(x = "SparkDataFrame"), function(x, numRows = 20, truncate = TRUE) { @@ -206,6 +212,7 @@ setMethod("showDF", #' df <- read.json(path) #' df #'} +#' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { cols <- lapply(dtypes(object), function(l) { @@ -232,6 +239,7 @@ setMethod("show", "SparkDataFrame", #' df <- read.json(path) #' dtypes(df) #'} +#' @note dtypes since 1.4.0 setMethod("dtypes", signature(x = "SparkDataFrame"), function(x) { @@ -259,6 +267,7 @@ setMethod("dtypes", #' columns(df) #' colnames(df) #'} +#' @note columns since 1.4.0 setMethod("columns", signature(x = "SparkDataFrame"), function(x) { @@ -269,6 +278,7 @@ setMethod("columns", #' @rdname columns #' @name names +#' @note names since 1.5.0 setMethod("names", signature(x = "SparkDataFrame"), function(x) { @@ -277,6 +287,7 @@ setMethod("names", #' @rdname columns #' @name names<- +#' @note names<- since 1.5.0 setMethod("names<-", signature(x = "SparkDataFrame"), function(x, value) { @@ -288,6 +299,7 @@ setMethod("names<-", #' @rdname columns #' @name colnames +#' @note colnames since 1.6.0 setMethod("colnames", signature(x = "SparkDataFrame"), function(x) { @@ -296,6 +308,7 @@ setMethod("colnames", #' @rdname columns #' @name colnames<-
spark git commit: [MINOR] Closing stale pull requests.
Repository: spark Updated Branches: refs/heads/master 359c2e827 -> 92514232e [MINOR] Closing stale pull requests. Closes #13114 Closes #10187 Closes #13432 Closes #13550 Author: Sean Owen Closes #13781 from srowen/CloseStalePR. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92514232 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92514232 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92514232 Branch: refs/heads/master Commit: 92514232e52af0f5f0413ed97b9571b1b9daaa90 Parents: 359c2e8 Author: Sean Owen Authored: Mon Jun 20 22:12:55 2016 +0100 Committer: Sean Owen Committed: Mon Jun 20 22:12:55 2016 +0100 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates
Repository: spark Updated Branches: refs/heads/branch-2.0 45c41aa33 -> f90b2ea1d [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung Closes #13751 from felixcheung/rsparksessiondoc. (cherry picked from commit 359c2e827d5682249c009e83379a5ee8e5aa4e89) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f90b2ea1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f90b2ea1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f90b2ea1 Branch: refs/heads/branch-2.0 Commit: f90b2ea1d96bba4650b8d1ce37a60c81c89bca96 Parents: 45c41aa Author: Felix Cheung Authored: Mon Jun 20 13:46:24 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 13:46:32 2016 -0700 -- R/pkg/R/DataFrame.R | 169 +-- R/pkg/R/SQLContext.R| 47 +++- R/pkg/R/mllib.R | 6 +- R/pkg/R/schema.R| 24 ++-- R/pkg/R/sparkR.R| 7 +- docs/sparkr.md | 99 examples/src/main/r/data-manipulation.R | 15 +-- examples/src/main/r/dataframe.R | 13 +-- examples/src/main/r/ml.R| 21 ++-- 9 files changed, 162 insertions(+), 239 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f90b2ea1/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index f3a3eff..583d3ae 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -35,12 +35,11 @@ setOldClass("structType") #' @slot env An R environment that stores bookkeeping states of the SparkDataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame #' @seealso \link{createDataFrame}, \link{read.json}, \link{table} -#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} +#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe} #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' df <- createDataFrame(faithful) #'} setClass("SparkDataFrame", @@ -77,8 +76,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' printSchema(df) @@ -102,8 +100,7 @@ setMethod("printSchema", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' dfSchema <- schema(df) @@ -126,8 +123,7 @@ setMethod("schema", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' explain(df, TRUE) @@ -157,8 +153,7 @@ setMethod("explain", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' isLocal(df) @@ -182,8 +177,7 @@ setMethod("isLocal", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' showDF(df) @@ -207,8 +201,7 @@ setMethod("showDF", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' df @@ -234,8 +227,7 @@ setMethod("show", "SparkDataFrame", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' dtypes(df) @@ -261,8 +253,7 @@ setMethod("dtypes", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' columns(df) @@ -396,8 +387,7 @@ setMethod("coltypes", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' coltypes(df) <- c("character", "integer") @@ -432,7 +422,7 @@ setMethod("coltypes<-", #' Creates a temporary view using the given name. #' -#' Creates a new temporary view using a SparkDataFrame in the SQLContex
spark git commit: [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates
Repository: spark Updated Branches: refs/heads/master b0f2fb5b9 -> 359c2e827 [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung Closes #13751 from felixcheung/rsparksessiondoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/359c2e82 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/359c2e82 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/359c2e82 Branch: refs/heads/master Commit: 359c2e827d5682249c009e83379a5ee8e5aa4e89 Parents: b0f2fb5 Author: Felix Cheung Authored: Mon Jun 20 13:46:24 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 13:46:24 2016 -0700 -- R/pkg/R/DataFrame.R | 169 +-- R/pkg/R/SQLContext.R| 47 +++- R/pkg/R/mllib.R | 6 +- R/pkg/R/schema.R| 24 ++-- R/pkg/R/sparkR.R| 7 +- docs/sparkr.md | 99 examples/src/main/r/data-manipulation.R | 15 +-- examples/src/main/r/dataframe.R | 13 +-- examples/src/main/r/ml.R| 21 ++-- 9 files changed, 162 insertions(+), 239 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/359c2e82/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index f3a3eff..583d3ae 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -35,12 +35,11 @@ setOldClass("structType") #' @slot env An R environment that stores bookkeeping states of the SparkDataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame #' @seealso \link{createDataFrame}, \link{read.json}, \link{table} -#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} +#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe} #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' df <- createDataFrame(faithful) #'} setClass("SparkDataFrame", @@ -77,8 +76,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' printSchema(df) @@ -102,8 +100,7 @@ setMethod("printSchema", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' dfSchema <- schema(df) @@ -126,8 +123,7 @@ setMethod("schema", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' explain(df, TRUE) @@ -157,8 +153,7 @@ setMethod("explain", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' isLocal(df) @@ -182,8 +177,7 @@ setMethod("isLocal", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' showDF(df) @@ -207,8 +201,7 @@ setMethod("showDF", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' df @@ -234,8 +227,7 @@ setMethod("show", "SparkDataFrame", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' dtypes(df) @@ -261,8 +253,7 @@ setMethod("dtypes", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' columns(df) @@ -396,8 +387,7 @@ setMethod("coltypes", #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' sqlContext <- sparkRSQL.init(sc) +#' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) #' coltypes(df) <- c("character", "integer") @@ -432,7 +422,7 @@ setMethod("coltypes<-", #' Creates a temporary view using the given name. #' -#' Creates a new temporary view using a SparkDataFrame in the SQLContext. If a +#' Creates a new temporary view using a SparkDataFrame in the Spark Session. If a #' temporary view with
spark git commit: [SPARK-16053][R] Add `spark_partition_id` in SparkR
Repository: spark Updated Branches: refs/heads/branch-2.0 dfa920204 -> 45c41aa33 [SPARK-16053][R] Add `spark_partition_id` in SparkR ## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id( id SPARK_PARTITION_ID() 1 30 2 40 3 81 4 91 5 02 6 13 7 24 8 55 9 66 10 77 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun Closes #13768 from dongjoon-hyun/SPARK-16053. (cherry picked from commit b0f2fb5b9729b38744bf784f2072f5ee52314f87) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45c41aa3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45c41aa3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45c41aa3 Branch: refs/heads/branch-2.0 Commit: 45c41aa33b39bfc38b8615fde044356a590edcfb Parents: dfa9202 Author: Dongjoon Hyun Authored: Mon Jun 20 13:41:03 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 13:41:11 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 21 + R/pkg/R/generics.R| 4 R/pkg/inst/tests/testthat/test_sparkSQL.R | 1 + 4 files changed, 27 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index aaeab66..45663f4 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -260,6 +260,7 @@ exportMethods("%in%", "skewness", "sort_array", "soundex", + "spark_partition_id", "stddev", "stddev_pop", "stddev_samp", http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 0fb38bc..c26f963 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -1206,6 +1206,27 @@ setMethod("soundex", column(jc) }) +#' Return the partition ID as a column +#' +#' Return the partition ID of the Spark task as a SparkDataFrame column. +#' Note that this is nondeterministic because it depends on data partitioning and +#' task scheduling. +#' +#' This is equivalent to the SPARK_PARTITION_ID function in SQL. +#' +#' @rdname spark_partition_id +#' @name spark_partition_id +#' @export +#' @examples +#' \dontrun{select(df, spark_partition_id())} +#' @note spark_partition_id since 2.0.0 +setMethod("spark_partition_id", + signature(x = "missing"), + function() { +jc <- callJStatic("org.apache.spark.sql.functions", "spark_partition_id") +column(jc) + }) + #' @rdname sd #' @name stddev setMethod("stddev", http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index dcc1cf2..f6b9276 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1135,6 +1135,10 @@ setGeneric("sort_array", function(x, asc = TRUE) { standardGeneric("sort_array") #' @export setGeneric("soundex", function(x) { standardGeneric("soundex") }) +#' @rdname spark_partition_id +#' @export +setGeneric("spark_partition_id", function(x) { standardGeneric("spark_partition_id") }) + #' @rdname sd #' @export setGeneric("stddev", function(x) { standardGeneric("stddev") }) http://git-wip-us.apache.org/repos/asf/spark/blob/45c41aa3/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index 114fec6..d53c40d 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -1059,6 +1059,7 @@ test_that("column functions", { c16 <- is.nan(c) + isnan(c) + isNaN(c) c17 <- cov(c, c1) + cov("c", "c1") + covar_samp(c, c1) + covar_samp("c", "c1") c18 <- covar_pop(c, c1) + covar_pop("c", "c1") + c19 <- spark_partition_id() # Test if base::is.nan()
spark git commit: [SPARK-16053][R] Add `spark_partition_id` in SparkR
Repository: spark Updated Branches: refs/heads/master aee1420ec -> b0f2fb5b9 [SPARK-16053][R] Add `spark_partition_id` in SparkR ## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id( id SPARK_PARTITION_ID() 1 30 2 40 3 81 4 91 5 02 6 13 7 24 8 55 9 66 10 77 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun Closes #13768 from dongjoon-hyun/SPARK-16053. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b0f2fb5b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b0f2fb5b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b0f2fb5b Branch: refs/heads/master Commit: b0f2fb5b9729b38744bf784f2072f5ee52314f87 Parents: aee1420 Author: Dongjoon Hyun Authored: Mon Jun 20 13:41:03 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 13:41:03 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 21 + R/pkg/R/generics.R| 4 R/pkg/inst/tests/testthat/test_sparkSQL.R | 1 + 4 files changed, 27 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index aaeab66..45663f4 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -260,6 +260,7 @@ exportMethods("%in%", "skewness", "sort_array", "soundex", + "spark_partition_id", "stddev", "stddev_pop", "stddev_samp", http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 0fb38bc..c26f963 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -1206,6 +1206,27 @@ setMethod("soundex", column(jc) }) +#' Return the partition ID as a column +#' +#' Return the partition ID of the Spark task as a SparkDataFrame column. +#' Note that this is nondeterministic because it depends on data partitioning and +#' task scheduling. +#' +#' This is equivalent to the SPARK_PARTITION_ID function in SQL. +#' +#' @rdname spark_partition_id +#' @name spark_partition_id +#' @export +#' @examples +#' \dontrun{select(df, spark_partition_id())} +#' @note spark_partition_id since 2.0.0 +setMethod("spark_partition_id", + signature(x = "missing"), + function() { +jc <- callJStatic("org.apache.spark.sql.functions", "spark_partition_id") +column(jc) + }) + #' @rdname sd #' @name stddev setMethod("stddev", http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index dcc1cf2..f6b9276 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1135,6 +1135,10 @@ setGeneric("sort_array", function(x, asc = TRUE) { standardGeneric("sort_array") #' @export setGeneric("soundex", function(x) { standardGeneric("soundex") }) +#' @rdname spark_partition_id +#' @export +setGeneric("spark_partition_id", function(x) { standardGeneric("spark_partition_id") }) + #' @rdname sd #' @export setGeneric("stddev", function(x) { standardGeneric("stddev") }) http://git-wip-us.apache.org/repos/asf/spark/blob/b0f2fb5b/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index 114fec6..d53c40d 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -1059,6 +1059,7 @@ test_that("column functions", { c16 <- is.nan(c) + isnan(c) + isNaN(c) c17 <- cov(c, c1) + cov("c", "c1") + covar_samp(c, c1) + covar_samp("c", "c1") c18 <- covar_pop(c, c1) + covar_pop("c", "c1") + c19 <- spark_partition_id() # Test if base::is.nan() is exposed expect_equal(is.nan(c("a", "b")), c(FALSE, FALSE))
spark git commit: [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time
Repository: spark Updated Branches: refs/heads/branch-1.6 16b7f1dfc -> db86e7fd2 [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. Added regression tests. Author: Davies Liu Closes #13652 from davies/fix_timezone. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db86e7fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db86e7fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db86e7fd Branch: refs/heads/branch-1.6 Commit: db86e7fd263ca4e24cf8faad95fca3189bab2fb0 Parents: 16b7f1d Author: Davies Liu Authored: Sun Jun 19 00:34:52 2016 -0700 Committer: Davies Liu Committed: Tue Jun 21 04:38:16 2016 +0800 -- .../spark/sql/catalyst/util/DateTimeUtils.scala | 50 ++-- .../org/apache/spark/sql/types/DateType.scala | 2 +- .../sql/catalyst/util/DateTimeTestUtils.scala | 40 .../sql/catalyst/util/DateTimeUtilsSuite.scala | 40 4 files changed, 128 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db86e7fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 2b93882..157ac2b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -89,8 +89,8 @@ object DateTimeUtils { // reverse of millisToDays def daysToMillis(days: SQLDate): Long = { -val millisUtc = days.toLong * MILLIS_PER_DAY -millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) +val millisLocal = days.toLong * MILLIS_PER_DAY +millisLocal - getOffsetFromLocalMillis(millisLocal, threadLocalLocalTimeZone.get()) } def dateToString(days: SQLDate): String = @@ -820,6 +820,41 @@ object DateTimeUtils { } /** + * Lookup the offset for given millis seconds since 1970-01-01 00:00:00 in given timezone. + */ + private def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long = { +var guess = tz.getRawOffset +// the actual offset should be calculated based on milliseconds in UTC +val offset = tz.getOffset(millisLocal - guess) +if (offset != guess) { + guess = tz.getOffset(millisLocal - offset) + if (guess != offset) { +// fallback to do the reverse lookup using java.sql.Timestamp +// this should only happen near the start or end of DST +val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt +val year = getYear(days) +val month = getMonth(days) +val day = getDayOfMonth(days) + +var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt +if (millisOfDay < 0) { + millisOfDay += MILLIS_PER_DAY.toInt +} +val seconds = (millisOfDay / 1000L).toInt +val hh = seconds / 3600 +val mm = seconds / 60 % 60 +val ss = seconds % 60 +val nano = millisOfDay % 1000 * 100 + +// create a Timestamp to get the unix timestamp (in UTC) +val timestamp = new Timestamp(year - 1900, month - 1, day, hh, mm, ss, nano) +guess = (millisLocal - timestamp.getTime).toInt + } +} +guess + } + + /** * Returns a timestamp of given timezone from utc timestamp, with the same string * representation in their timezone. */ @@ -835,7 +870,16 @@ object DateTimeUtils { */ def toUTCTime(time: SQLTimestamp, timeZone: String): SQLTimestamp = { val tz = TimeZone.getTimeZone(timeZone) -val offset = tz.getOffset(time / 1000L) +val offset = getOffsetFromLocalMillis(time / 1000L, tz) time - offset * 1000L } + + /** + * Re-initialize the current thread's thread locals. Exposed for testing. + */ + private[util] def resetThreadLocals(): Unit = { +th
spark git commit: [SPARKR] fix R roxygen2 doc for count on GroupedData
Repository: spark Updated Branches: refs/heads/branch-2.0 d2c94e6a4 -> dfa920204 [SPARKR] fix R roxygen2 doc for count on GroupedData ## What changes were proposed in this pull request? fix code doc ## How was this patch tested? manual shivaram Author: Felix Cheung Closes #13782 from felixcheung/rcountdoc. (cherry picked from commit aee1420eca64dfc145f31b8c653388fafc5ccd8f) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dfa92020 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dfa92020 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dfa92020 Branch: refs/heads/branch-2.0 Commit: dfa920204e3407c38df9012ca42b7b56c416a5b3 Parents: d2c94e6 Author: Felix Cheung Authored: Mon Jun 20 12:31:00 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 12:31:08 2016 -0700 -- R/pkg/R/group.R | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dfa92020/R/pkg/R/group.R -- diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R index eba083f..65b9e84 100644 --- a/R/pkg/R/group.R +++ b/R/pkg/R/group.R @@ -58,7 +58,7 @@ setMethod("show", "GroupedData", #' #' @param x a GroupedData #' @return a SparkDataFrame -#' @rdname agg +#' @rdname count #' @export #' @examples #' \dontrun{ @@ -83,6 +83,7 @@ setMethod("count", #' @rdname summarize #' @name agg #' @family agg_funcs +#' @export #' @examples #' \dontrun{ #' df2 <- agg(df, age = "sum") # new column name will be created as 'SUM(age#0)' @@ -160,6 +161,7 @@ createMethods() #' @return a SparkDataFrame #' @rdname gapply #' @name gapply +#' @export #' @examples #' \dontrun{ #' Computes the arithmetic mean of the second column by grouping - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR] fix R roxygen2 doc for count on GroupedData
Repository: spark Updated Branches: refs/heads/master 46d98e0a1 -> aee1420ec [SPARKR] fix R roxygen2 doc for count on GroupedData ## What changes were proposed in this pull request? fix code doc ## How was this patch tested? manual shivaram Author: Felix Cheung Closes #13782 from felixcheung/rcountdoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aee1420e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aee1420e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aee1420e Branch: refs/heads/master Commit: aee1420eca64dfc145f31b8c653388fafc5ccd8f Parents: 46d98e0 Author: Felix Cheung Authored: Mon Jun 20 12:31:00 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 12:31:00 2016 -0700 -- R/pkg/R/group.R | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aee1420e/R/pkg/R/group.R -- diff --git a/R/pkg/R/group.R b/R/pkg/R/group.R index eba083f..65b9e84 100644 --- a/R/pkg/R/group.R +++ b/R/pkg/R/group.R @@ -58,7 +58,7 @@ setMethod("show", "GroupedData", #' #' @param x a GroupedData #' @return a SparkDataFrame -#' @rdname agg +#' @rdname count #' @export #' @examples #' \dontrun{ @@ -83,6 +83,7 @@ setMethod("count", #' @rdname summarize #' @name agg #' @family agg_funcs +#' @export #' @examples #' \dontrun{ #' df2 <- agg(df, age = "sum") # new column name will be created as 'SUM(age#0)' @@ -160,6 +161,7 @@ createMethods() #' @return a SparkDataFrame #' @rdname gapply #' @name gapply +#' @export #' @examples #' \dontrun{ #' Computes the arithmetic mean of the second column by grouping - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16028][SPARKR] spark.lapply can work with active context
Repository: spark Updated Branches: refs/heads/branch-2.0 ead872e49 -> d2c94e6a4 [SPARK-16028][SPARKR] spark.lapply can work with active context ## What changes were proposed in this pull request? spark.lapply and setLogLevel ## How was this patch tested? unit test shivaram thunterdb Author: Felix Cheung Closes #13752 from felixcheung/rlapply. (cherry picked from commit 46d98e0a1f40a4c6ae92253c5c498a3a924497fc) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2c94e6a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2c94e6a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2c94e6a Branch: refs/heads/branch-2.0 Commit: d2c94e6a45090cf545fe1e243f3dfde5ed87b4d0 Parents: ead872e Author: Felix Cheung Authored: Mon Jun 20 12:08:42 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 12:08:49 2016 -0700 -- R/pkg/R/context.R| 20 +--- R/pkg/inst/tests/testthat/test_context.R | 6 +++--- 2 files changed, 16 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d2c94e6a/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 5c88603..968a9d2 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -252,17 +252,20 @@ setCheckpointDir <- function(sc, dirName) { #' } #' #' @rdname spark.lapply -#' @param sc Spark Context to use #' @param list the list of elements #' @param func a function that takes one argument. #' @return a list of results (the exact type being determined by the function) #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' doubled <- spark.lapply(sc, 1:10, function(x){2 * x}) +#' sparkR.session() +#' doubled <- spark.lapply(1:10, function(x){2 * x}) #'} -spark.lapply <- function(sc, list, func) { +spark.lapply <- function(list, func) { + if (!exists(".sparkRjsc", envir = .sparkREnv)) { +stop("SparkR has not been initialized. Please call sparkR.session()") + } + sc <- get(".sparkRjsc", envir = .sparkREnv) rdd <- parallelize(sc, list, length(list)) results <- map(rdd, func) local <- collect(results) @@ -274,14 +277,17 @@ spark.lapply <- function(sc, list, func) { #' Set new log level: "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN" #' #' @rdname setLogLevel -#' @param sc Spark Context to use #' @param level New log level #' @export #' @examples #'\dontrun{ -#' setLogLevel(sc, "ERROR") +#' setLogLevel("ERROR") #'} -setLogLevel <- function(sc, level) { +setLogLevel <- function(level) { + if (!exists(".sparkRjsc", envir = .sparkREnv)) { +stop("SparkR has not been initialized. Please call sparkR.session()") + } + sc <- get(".sparkRjsc", envir = .sparkREnv) callJMethod(sc, "setLogLevel", level) } http://git-wip-us.apache.org/repos/asf/spark/blob/d2c94e6a/R/pkg/inst/tests/testthat/test_context.R -- diff --git a/R/pkg/inst/tests/testthat/test_context.R b/R/pkg/inst/tests/testthat/test_context.R index f123187..b149818 100644 --- a/R/pkg/inst/tests/testthat/test_context.R +++ b/R/pkg/inst/tests/testthat/test_context.R @@ -107,8 +107,8 @@ test_that("job group functions can be called", { }) test_that("utility function can be called", { - sc <- sparkR.sparkContext() - setLogLevel(sc, "ERROR") + sparkR.sparkContext() + setLogLevel("ERROR") sparkR.session.stop() }) @@ -161,7 +161,7 @@ test_that("sparkJars sparkPackages as comma-separated strings", { test_that("spark.lapply should perform simple transforms", { sc <- sparkR.sparkContext() - doubled <- spark.lapply(sc, 1:10, function(x) { 2 * x }) + doubled <- spark.lapply(1:10, function(x) { 2 * x }) expect_equal(doubled, as.list(2 * 1:10)) sparkR.session.stop() }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16028][SPARKR] spark.lapply can work with active context
Repository: spark Updated Branches: refs/heads/master c44bf137c -> 46d98e0a1 [SPARK-16028][SPARKR] spark.lapply can work with active context ## What changes were proposed in this pull request? spark.lapply and setLogLevel ## How was this patch tested? unit test shivaram thunterdb Author: Felix Cheung Closes #13752 from felixcheung/rlapply. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46d98e0a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46d98e0a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46d98e0a Branch: refs/heads/master Commit: 46d98e0a1f40a4c6ae92253c5c498a3a924497fc Parents: c44bf13 Author: Felix Cheung Authored: Mon Jun 20 12:08:42 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 12:08:42 2016 -0700 -- R/pkg/R/context.R| 20 +--- R/pkg/inst/tests/testthat/test_context.R | 6 +++--- 2 files changed, 16 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/46d98e0a/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 5c88603..968a9d2 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -252,17 +252,20 @@ setCheckpointDir <- function(sc, dirName) { #' } #' #' @rdname spark.lapply -#' @param sc Spark Context to use #' @param list the list of elements #' @param func a function that takes one argument. #' @return a list of results (the exact type being determined by the function) #' @export #' @examples #'\dontrun{ -#' sc <- sparkR.init() -#' doubled <- spark.lapply(sc, 1:10, function(x){2 * x}) +#' sparkR.session() +#' doubled <- spark.lapply(1:10, function(x){2 * x}) #'} -spark.lapply <- function(sc, list, func) { +spark.lapply <- function(list, func) { + if (!exists(".sparkRjsc", envir = .sparkREnv)) { +stop("SparkR has not been initialized. Please call sparkR.session()") + } + sc <- get(".sparkRjsc", envir = .sparkREnv) rdd <- parallelize(sc, list, length(list)) results <- map(rdd, func) local <- collect(results) @@ -274,14 +277,17 @@ spark.lapply <- function(sc, list, func) { #' Set new log level: "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN" #' #' @rdname setLogLevel -#' @param sc Spark Context to use #' @param level New log level #' @export #' @examples #'\dontrun{ -#' setLogLevel(sc, "ERROR") +#' setLogLevel("ERROR") #'} -setLogLevel <- function(sc, level) { +setLogLevel <- function(level) { + if (!exists(".sparkRjsc", envir = .sparkREnv)) { +stop("SparkR has not been initialized. Please call sparkR.session()") + } + sc <- get(".sparkRjsc", envir = .sparkREnv) callJMethod(sc, "setLogLevel", level) } http://git-wip-us.apache.org/repos/asf/spark/blob/46d98e0a/R/pkg/inst/tests/testthat/test_context.R -- diff --git a/R/pkg/inst/tests/testthat/test_context.R b/R/pkg/inst/tests/testthat/test_context.R index f123187..b149818 100644 --- a/R/pkg/inst/tests/testthat/test_context.R +++ b/R/pkg/inst/tests/testthat/test_context.R @@ -107,8 +107,8 @@ test_that("job group functions can be called", { }) test_that("utility function can be called", { - sc <- sparkR.sparkContext() - setLogLevel(sc, "ERROR") + sparkR.sparkContext() + setLogLevel("ERROR") sparkR.session.stop() }) @@ -161,7 +161,7 @@ test_that("sparkJars sparkPackages as comma-separated strings", { test_that("spark.lapply should perform simple transforms", { sc <- sparkR.sparkContext() - doubled <- spark.lapply(sc, 1:10, function(x) { 2 * x }) + doubled <- spark.lapply(1:10, function(x) { 2 * x }) expect_equal(doubled, as.list(2 * 1:10)) sparkR.session.stop() }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16051][R] Add `read.orc/write.orc` to SparkR
Repository: spark Updated Branches: refs/heads/master 36e812d4b -> c44bf137c [SPARK-16051][R] Add `read.orc/write.orc` to SparkR ## What changes were proposed in this pull request? This issue adds `read.orc/write.orc` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcases). Author: Dongjoon Hyun Closes #13763 from dongjoon-hyun/SPARK-16051. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c44bf137 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c44bf137 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c44bf137 Branch: refs/heads/master Commit: c44bf137c7ca649e0c504229eb3e6ff7955e9a53 Parents: 36e812d Author: Dongjoon Hyun Authored: Mon Jun 20 11:30:26 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:30:26 2016 -0700 -- R/pkg/NAMESPACE | 2 ++ R/pkg/R/DataFrame.R | 27 ++ R/pkg/R/SQLContext.R | 21 +++- R/pkg/R/generics.R| 4 R/pkg/inst/tests/testthat/test_sparkSQL.R | 21 5 files changed, 74 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index cc129a7..aaeab66 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -117,6 +117,7 @@ exportMethods("arrange", "write.df", "write.jdbc", "write.json", + "write.orc", "write.parquet", "write.text", "write.ml") @@ -306,6 +307,7 @@ export("as.DataFrame", "read.df", "read.jdbc", "read.json", + "read.orc", "read.parquet", "read.text", "spark.lapply", http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index ea091c8..f3a3eff 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -701,6 +701,33 @@ setMethod("write.json", invisible(callJMethod(write, "json", path)) }) +#' Save the contents of SparkDataFrame as an ORC file, preserving the schema. +#' +#' Save the contents of a SparkDataFrame as an ORC file, preserving the schema. Files written out +#' with this method can be read back in as a SparkDataFrame using read.orc(). +#' +#' @param x A SparkDataFrame +#' @param path The directory where the file is saved +#' +#' @family SparkDataFrame functions +#' @rdname write.orc +#' @name write.orc +#' @export +#' @examples +#'\dontrun{ +#' sparkR.session() +#' path <- "path/to/file.json" +#' df <- read.json(path) +#' write.orc(df, "/tmp/sparkr-tmp1/") +#' } +#' @note write.orc since 2.0.0 +setMethod("write.orc", + signature(x = "SparkDataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "orc", path)) + }) + #' Save the contents of SparkDataFrame as a Parquet file, preserving the schema. #' #' Save the contents of a SparkDataFrame as a Parquet file, preserving the schema. Files written out http://git-wip-us.apache.org/repos/asf/spark/blob/c44bf137/R/pkg/R/SQLContext.R -- diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R index b0ccc42..b7e1c06 100644 --- a/R/pkg/R/SQLContext.R +++ b/R/pkg/R/SQLContext.R @@ -330,6 +330,25 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, samplingRatio = 1.0) { } } +#' Create a SparkDataFrame from an ORC file. +#' +#' Loads an ORC file, returning the result as a SparkDataFrame. +#' +#' @param path Path of file to read. +#' @return SparkDataFrame +#' @rdname read.orc +#' @export +#' @name read.orc +#' @note read.orc since 2.0.0 +read.orc <- function(path) { + sparkSession <- getSparkSession() + # Allow the user to have a more flexible definiton of the ORC file path + path <- suppressWarnings(normalizePath(path)) + read <- callJMethod(sparkSession, "read") + sdf <- callJMethod(read, "orc", path) + dataFrame(sdf) +} + #' Create a SparkDataFrame from a Parquet file. #' #' Loads a Parquet file, returning the result as a SparkDataFrame. @@ -343,7 +362,7 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, samplingRatio = 1.0) { read.parquet.default <- function(path) { sparkSession <- getSparkSession() - # Allow the user to have a more flexible definiton of the text file path + # Allow the user to have a more flexible definiton of the Parquet file path pa
spark git commit: [SPARK-16051][R] Add `read.orc/write.orc` to SparkR
Repository: spark Updated Branches: refs/heads/branch-2.0 5b22e34e9 -> ead872e49 [SPARK-16051][R] Add `read.orc/write.orc` to SparkR ## What changes were proposed in this pull request? This issue adds `read.orc/write.orc` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcases). Author: Dongjoon Hyun Closes #13763 from dongjoon-hyun/SPARK-16051. (cherry picked from commit c44bf137c7ca649e0c504229eb3e6ff7955e9a53) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ead872e4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ead872e4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ead872e4 Branch: refs/heads/branch-2.0 Commit: ead872e4996ad0c0b02debd1ab829ff67b79abfb Parents: 5b22e34 Author: Dongjoon Hyun Authored: Mon Jun 20 11:30:26 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:30:36 2016 -0700 -- R/pkg/NAMESPACE | 2 ++ R/pkg/R/DataFrame.R | 27 ++ R/pkg/R/SQLContext.R | 21 +++- R/pkg/R/generics.R| 4 R/pkg/inst/tests/testthat/test_sparkSQL.R | 21 5 files changed, 74 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index cc129a7..aaeab66 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -117,6 +117,7 @@ exportMethods("arrange", "write.df", "write.jdbc", "write.json", + "write.orc", "write.parquet", "write.text", "write.ml") @@ -306,6 +307,7 @@ export("as.DataFrame", "read.df", "read.jdbc", "read.json", + "read.orc", "read.parquet", "read.text", "spark.lapply", http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index ea091c8..f3a3eff 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -701,6 +701,33 @@ setMethod("write.json", invisible(callJMethod(write, "json", path)) }) +#' Save the contents of SparkDataFrame as an ORC file, preserving the schema. +#' +#' Save the contents of a SparkDataFrame as an ORC file, preserving the schema. Files written out +#' with this method can be read back in as a SparkDataFrame using read.orc(). +#' +#' @param x A SparkDataFrame +#' @param path The directory where the file is saved +#' +#' @family SparkDataFrame functions +#' @rdname write.orc +#' @name write.orc +#' @export +#' @examples +#'\dontrun{ +#' sparkR.session() +#' path <- "path/to/file.json" +#' df <- read.json(path) +#' write.orc(df, "/tmp/sparkr-tmp1/") +#' } +#' @note write.orc since 2.0.0 +setMethod("write.orc", + signature(x = "SparkDataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "orc", path)) + }) + #' Save the contents of SparkDataFrame as a Parquet file, preserving the schema. #' #' Save the contents of a SparkDataFrame as a Parquet file, preserving the schema. Files written out http://git-wip-us.apache.org/repos/asf/spark/blob/ead872e4/R/pkg/R/SQLContext.R -- diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R index b0ccc42..b7e1c06 100644 --- a/R/pkg/R/SQLContext.R +++ b/R/pkg/R/SQLContext.R @@ -330,6 +330,25 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, samplingRatio = 1.0) { } } +#' Create a SparkDataFrame from an ORC file. +#' +#' Loads an ORC file, returning the result as a SparkDataFrame. +#' +#' @param path Path of file to read. +#' @return SparkDataFrame +#' @rdname read.orc +#' @export +#' @name read.orc +#' @note read.orc since 2.0.0 +read.orc <- function(path) { + sparkSession <- getSparkSession() + # Allow the user to have a more flexible definiton of the ORC file path + path <- suppressWarnings(normalizePath(path)) + read <- callJMethod(sparkSession, "read") + sdf <- callJMethod(read, "orc", path) + dataFrame(sdf) +} + #' Create a SparkDataFrame from a Parquet file. #' #' Loads a Parquet file, returning the result as a SparkDataFrame. @@ -343,7 +362,7 @@ jsonRDD <- function(sqlContext, rdd, schema = NULL, samplingRatio = 1.0) { read.parquet.default <- function(path) { sparkSession <- getSparkSession() - # Allow the user to have a more flexible
spark git commit: [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
Repository: spark Updated Branches: refs/heads/branch-2.0 bb80d1c24 -> 5b22e34e9 [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable ## What changes were proposed in this pull request? Add dropTempView and deprecate dropTempTable ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung Closes #13753 from felixcheung/rdroptempview. (cherry picked from commit 36e812d4b695566437c6bac991ef06a0f81fb1c5) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5b22e34e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5b22e34e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5b22e34e Branch: refs/heads/branch-2.0 Commit: 5b22e34e96f7795a0e8d547eba2229b60f999fa5 Parents: bb80d1c Author: Felix Cheung Authored: Mon Jun 20 11:24:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:24:48 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/SQLContext.R | 39 ++ R/pkg/inst/tests/testthat/test_sparkSQL.R | 14 - 3 files changed, 41 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 0cfe190..cc129a7 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -299,6 +299,7 @@ export("as.DataFrame", "createDataFrame", "createExternalTable", "dropTempTable", + "dropTempView", "jsonFile", "loadDF", "parquetFile", http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/R/SQLContext.R -- diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R index 3232241..b0ccc42 100644 --- a/R/pkg/R/SQLContext.R +++ b/R/pkg/R/SQLContext.R @@ -599,13 +599,14 @@ clearCache <- function() { dispatchFunc("clearCache()") } -#' Drop Temporary Table +#' (Deprecated) Drop Temporary Table #' #' Drops the temporary table with the given table name in the catalog. #' If the table has been cached/persisted before, it's also unpersisted. #' #' @param tableName The name of the SparkSQL table to be dropped. -#' @rdname dropTempTable +#' @seealso \link{dropTempView} +#' @rdname dropTempTable-deprecated #' @export #' @examples #' \dontrun{ @@ -619,16 +620,42 @@ clearCache <- function() { #' @method dropTempTable default dropTempTable.default <- function(tableName) { - sparkSession <- getSparkSession() if (class(tableName) != "character") { stop("tableName must be a string.") } - catalog <- callJMethod(sparkSession, "catalog") - callJMethod(catalog, "dropTempView", tableName) + dropTempView(tableName) } dropTempTable <- function(x, ...) { - dispatchFunc("dropTempTable(tableName)", x, ...) + .Deprecated("dropTempView") + dispatchFunc("dropTempView(viewName)", x, ...) +} + +#' Drops the temporary view with the given view name in the catalog. +#' +#' Drops the temporary view with the given view name in the catalog. +#' If the view has been cached before, then it will also be uncached. +#' +#' @param viewName the name of the view to be dropped. +#' @rdname dropTempView +#' @name dropTempView +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' df <- read.df(path, "parquet") +#' createOrReplaceTempView(df, "table") +#' dropTempView("table") +#' } +#' @note since 2.0.0 + +dropTempView <- function(viewName) { + sparkSession <- getSparkSession() + if (class(viewName) != "character") { +stop("viewName must be a string.") + } + catalog <- callJMethod(sparkSession, "catalog") + callJMethod(catalog, "dropTempView", viewName) } #' Load a SparkDataFrame http://git-wip-us.apache.org/repos/asf/spark/blob/5b22e34e/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index c5c5a06..ceba0d1 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -472,8 +472,8 @@ test_that("test tableNames and tables", { suppressWarnings(registerTempTable(df, "table2")) tables <- tables() expect_equal(count(tables), 2) - dropTempTable("table1") - dropTempTable("table2") + suppressWarnings(dropTempTable("table1")) + dropTempView("table2") tables <- tables() expect_equal(count(tables), 0) @@ -486,7 +486,7 @@ test_that( newdf <- sql("SELECT * FROM table1 where name = 'Michael'") expect_is(newdf, "SparkDataFrame") expect_equal(count(newdf), 1) - dropTempTable("table1") + dropTempView("table1")
spark git commit: [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
Repository: spark Updated Branches: refs/heads/master 961342489 -> 36e812d4b [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable ## What changes were proposed in this pull request? Add dropTempView and deprecate dropTempTable ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung Closes #13753 from felixcheung/rdroptempview. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36e812d4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36e812d4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36e812d4 Branch: refs/heads/master Commit: 36e812d4b695566437c6bac991ef06a0f81fb1c5 Parents: 9613424 Author: Felix Cheung Authored: Mon Jun 20 11:24:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:24:41 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/SQLContext.R | 39 ++ R/pkg/inst/tests/testthat/test_sparkSQL.R | 14 - 3 files changed, 41 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 0cfe190..cc129a7 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -299,6 +299,7 @@ export("as.DataFrame", "createDataFrame", "createExternalTable", "dropTempTable", + "dropTempView", "jsonFile", "loadDF", "parquetFile", http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/R/SQLContext.R -- diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R index 3232241..b0ccc42 100644 --- a/R/pkg/R/SQLContext.R +++ b/R/pkg/R/SQLContext.R @@ -599,13 +599,14 @@ clearCache <- function() { dispatchFunc("clearCache()") } -#' Drop Temporary Table +#' (Deprecated) Drop Temporary Table #' #' Drops the temporary table with the given table name in the catalog. #' If the table has been cached/persisted before, it's also unpersisted. #' #' @param tableName The name of the SparkSQL table to be dropped. -#' @rdname dropTempTable +#' @seealso \link{dropTempView} +#' @rdname dropTempTable-deprecated #' @export #' @examples #' \dontrun{ @@ -619,16 +620,42 @@ clearCache <- function() { #' @method dropTempTable default dropTempTable.default <- function(tableName) { - sparkSession <- getSparkSession() if (class(tableName) != "character") { stop("tableName must be a string.") } - catalog <- callJMethod(sparkSession, "catalog") - callJMethod(catalog, "dropTempView", tableName) + dropTempView(tableName) } dropTempTable <- function(x, ...) { - dispatchFunc("dropTempTable(tableName)", x, ...) + .Deprecated("dropTempView") + dispatchFunc("dropTempView(viewName)", x, ...) +} + +#' Drops the temporary view with the given view name in the catalog. +#' +#' Drops the temporary view with the given view name in the catalog. +#' If the view has been cached before, then it will also be uncached. +#' +#' @param viewName the name of the view to be dropped. +#' @rdname dropTempView +#' @name dropTempView +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' df <- read.df(path, "parquet") +#' createOrReplaceTempView(df, "table") +#' dropTempView("table") +#' } +#' @note since 2.0.0 + +dropTempView <- function(viewName) { + sparkSession <- getSparkSession() + if (class(viewName) != "character") { +stop("viewName must be a string.") + } + catalog <- callJMethod(sparkSession, "catalog") + callJMethod(catalog, "dropTempView", viewName) } #' Load a SparkDataFrame http://git-wip-us.apache.org/repos/asf/spark/blob/36e812d4/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index c5c5a06..ceba0d1 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -472,8 +472,8 @@ test_that("test tableNames and tables", { suppressWarnings(registerTempTable(df, "table2")) tables <- tables() expect_equal(count(tables), 2) - dropTempTable("table1") - dropTempTable("table2") + suppressWarnings(dropTempTable("table1")) + dropTempView("table2") tables <- tables() expect_equal(count(tables), 0) @@ -486,7 +486,7 @@ test_that( newdf <- sql("SELECT * FROM table1 where name = 'Michael'") expect_is(newdf, "SparkDataFrame") expect_equal(count(newdf), 1) - dropTempTable("table1") + dropTempView("table1") }) test_that("test cache, uncache and clearCache", { @@ -495,7 +495,7 @@ test_that("test cache, uncache and clea
spark git commit: [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR
Repository: spark Updated Branches: refs/heads/branch-2.0 363db9f8b -> bb80d1c24 [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR ## What changes were proposed in this pull request? This PR adds `monotonically_increasing_id` column function in SparkR for API parity. After this PR, SparkR supports the followings. ```r > df <- read.json("examples/src/main/resources/people.json") > collect(select(df, monotonically_increasing_id(), df$name, df$age)) monotonically_increasing_id()name age 1 0 Michael NA 2 1Andy 30 3 2 Justin 19 ``` ## How was this patch tested? Pass the Jenkins tests (with added testcase). Author: Dongjoon Hyun Closes #13774 from dongjoon-hyun/SPARK-16059. (cherry picked from commit 9613424898fd2a586156bc4eb48e255749774f20) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb80d1c2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb80d1c2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb80d1c2 Branch: refs/heads/branch-2.0 Commit: bb80d1c24a633ceb4ad63b1fa8c02c66d79b2540 Parents: 363db9f Author: Dongjoon Hyun Authored: Mon Jun 20 11:12:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:12:51 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 27 ++ R/pkg/R/generics.R| 5 + R/pkg/inst/tests/testthat/test_sparkSQL.R | 2 +- 4 files changed, 34 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 82e56ca..0cfe190 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -218,6 +218,7 @@ exportMethods("%in%", "mean", "min", "minute", + "monotonically_increasing_id", "month", "months_between", "n", http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index a779127..0fb38bc 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -911,6 +911,33 @@ setMethod("minute", column(jc) }) +#' monotonically_increasing_id +#' +#' Return a column that generates monotonically increasing 64-bit integers. +#' +#' The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. +#' The current implementation puts the partition ID in the upper 31 bits, and the record number +#' within each partition in the lower 33 bits. The assumption is that the SparkDataFrame has +#' less than 1 billion partitions, and each partition has less than 8 billion records. +#' +#' As an example, consider a SparkDataFrame with two partitions, each with 3 records. +#' This expression would return the following IDs: +#' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. +#' +#' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL. +#' +#' @rdname monotonically_increasing_id +#' @name monotonically_increasing_id +#' @family misc_funcs +#' @export +#' @examples \dontrun{select(df, monotonically_increasing_id())} +setMethod("monotonically_increasing_id", + signature(x = "missing"), + function() { +jc <- callJStatic("org.apache.spark.sql.functions", "monotonically_increasing_id") +column(jc) + }) + #' month #' #' Extracts the month as an integer from a given date/timestamp/string. http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 6e754af..37d0556 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -993,6 +993,11 @@ setGeneric("md5", function(x) { standardGeneric("md5") }) #' @export setGeneric("minute", function(x) { standardGeneric("minute") }) +#' @rdname monotonically_increasing_id +#' @export +setGeneric("monotonically_increasing_id", + function(x) { standardGeneric("monotonically_increasing_id") }) + #' @rdname month #' @export setGeneric("month", function(x) { standardGeneric("month") }) http://git-wip-us.apache.org/repos/asf/spark/blob/bb80d1c2/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index fcc2ab3..c5c
spark git commit: [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR
Repository: spark Updated Branches: refs/heads/master 5cfabec87 -> 961342489 [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR ## What changes were proposed in this pull request? This PR adds `monotonically_increasing_id` column function in SparkR for API parity. After this PR, SparkR supports the followings. ```r > df <- read.json("examples/src/main/resources/people.json") > collect(select(df, monotonically_increasing_id(), df$name, df$age)) monotonically_increasing_id()name age 1 0 Michael NA 2 1Andy 30 3 2 Justin 19 ``` ## How was this patch tested? Pass the Jenkins tests (with added testcase). Author: Dongjoon Hyun Closes #13774 from dongjoon-hyun/SPARK-16059. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96134248 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96134248 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96134248 Branch: refs/heads/master Commit: 9613424898fd2a586156bc4eb48e255749774f20 Parents: 5cfabec Author: Dongjoon Hyun Authored: Mon Jun 20 11:12:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jun 20 11:12:41 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 27 ++ R/pkg/R/generics.R| 5 + R/pkg/inst/tests/testthat/test_sparkSQL.R | 2 +- 4 files changed, 34 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 82e56ca..0cfe190 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -218,6 +218,7 @@ exportMethods("%in%", "mean", "min", "minute", + "monotonically_increasing_id", "month", "months_between", "n", http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index a779127..0fb38bc 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -911,6 +911,33 @@ setMethod("minute", column(jc) }) +#' monotonically_increasing_id +#' +#' Return a column that generates monotonically increasing 64-bit integers. +#' +#' The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. +#' The current implementation puts the partition ID in the upper 31 bits, and the record number +#' within each partition in the lower 33 bits. The assumption is that the SparkDataFrame has +#' less than 1 billion partitions, and each partition has less than 8 billion records. +#' +#' As an example, consider a SparkDataFrame with two partitions, each with 3 records. +#' This expression would return the following IDs: +#' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. +#' +#' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL. +#' +#' @rdname monotonically_increasing_id +#' @name monotonically_increasing_id +#' @family misc_funcs +#' @export +#' @examples \dontrun{select(df, monotonically_increasing_id())} +setMethod("monotonically_increasing_id", + signature(x = "missing"), + function() { +jc <- callJStatic("org.apache.spark.sql.functions", "monotonically_increasing_id") +column(jc) + }) + #' month #' #' Extracts the month as an integer from a given date/timestamp/string. http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 6e754af..37d0556 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -993,6 +993,11 @@ setGeneric("md5", function(x) { standardGeneric("md5") }) #' @export setGeneric("minute", function(x) { standardGeneric("minute") }) +#' @rdname monotonically_increasing_id +#' @export +setGeneric("monotonically_increasing_id", + function(x) { standardGeneric("monotonically_increasing_id") }) + #' @rdname month #' @export setGeneric("month", function(x) { standardGeneric("month") }) http://git-wip-us.apache.org/repos/asf/spark/blob/96134248/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index fcc2ab3..c5c5a06 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -1047
spark git commit: [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite
Repository: spark Updated Branches: refs/heads/master 905f774b7 -> 5cfabec87 [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite ## What changes were proposed in this pull request? ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail. Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it. ## How was this patch tested? Just removed a flaky test. Author: Shixiong Zhu Closes #13776 from zsxwing/SPARK-16050. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5cfabec8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5cfabec8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5cfabec8 Branch: refs/heads/master Commit: 5cfabec8724714b897d6e23e670c39e58f554ea2 Parents: 905f774 Author: Shixiong Zhu Authored: Mon Jun 20 10:35:37 2016 -0700 Committer: Michael Armbrust Committed: Mon Jun 20 10:35:37 2016 -0700 -- .../execution/streaming/ConsoleSinkSuite.scala | 99 1 file changed, 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5cfabec8/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala deleted file mode 100644 index e853d8c..000 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala +++ /dev/null @@ -1,99 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.streaming - -import java.io.{ByteArrayOutputStream, PrintStream} -import java.nio.charset.StandardCharsets.UTF_8 - -import org.scalatest.BeforeAndAfter - -import org.apache.spark.sql.streaming.StreamTest - -class ConsoleSinkSuite extends StreamTest with BeforeAndAfter { - - import testImplicits._ - - after { -sqlContext.streams.active.foreach(_.stop()) - } - - test("SPARK-16020 Complete mode aggregation with console sink") { -withTempDir { checkpointLocation => - val origOut = System.out - val stdout = new ByteArrayOutputStream() - try { -// Hook Java System.out.println -System.setOut(new PrintStream(stdout)) -// Hook Scala println -Console.withOut(stdout) { - val input = MemoryStream[String] - val df = input.toDF().groupBy("value").count() - val query = df.writeStream -.format("console") -.outputMode("complete") -.option("checkpointLocation", checkpointLocation.getAbsolutePath) -.start() - input.addData("a") - query.processAllAvailable() - input.addData("a", "b") - query.processAllAvailable() - input.addData("a", "b", "c") - query.processAllAvailable() - query.stop() -} -System.out.flush() - } finally { -System.setOut(origOut) - } - - val expected = """--- -|Batch: 0 -|--- -|+-+-+ -||value|count| -|+-+-+ -||a|1| -|+-+-+ -| -|--- -|Batch: 1 -|--- -|+-+-+ -||value|count| -|+-+-+ -||a|2| -||b|1| -|+-+-+ -
spark git commit: [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite
Repository: spark Updated Branches: refs/heads/branch-2.0 0b0b5fe54 -> 363db9f8b [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite ## What changes were proposed in this pull request? ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail. Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it. ## How was this patch tested? Just removed a flaky test. Author: Shixiong Zhu Closes #13776 from zsxwing/SPARK-16050. (cherry picked from commit 5cfabec8724714b897d6e23e670c39e58f554ea2) Signed-off-by: Michael Armbrust Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/363db9f8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/363db9f8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/363db9f8 Branch: refs/heads/branch-2.0 Commit: 363db9f8be53773238854ab16c3459ba46a6961b Parents: 0b0b5fe Author: Shixiong Zhu Authored: Mon Jun 20 10:35:37 2016 -0700 Committer: Michael Armbrust Committed: Mon Jun 20 10:35:49 2016 -0700 -- .../execution/streaming/ConsoleSinkSuite.scala | 99 1 file changed, 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/363db9f8/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala deleted file mode 100644 index e853d8c..000 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ConsoleSinkSuite.scala +++ /dev/null @@ -1,99 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.streaming - -import java.io.{ByteArrayOutputStream, PrintStream} -import java.nio.charset.StandardCharsets.UTF_8 - -import org.scalatest.BeforeAndAfter - -import org.apache.spark.sql.streaming.StreamTest - -class ConsoleSinkSuite extends StreamTest with BeforeAndAfter { - - import testImplicits._ - - after { -sqlContext.streams.active.foreach(_.stop()) - } - - test("SPARK-16020 Complete mode aggregation with console sink") { -withTempDir { checkpointLocation => - val origOut = System.out - val stdout = new ByteArrayOutputStream() - try { -// Hook Java System.out.println -System.setOut(new PrintStream(stdout)) -// Hook Scala println -Console.withOut(stdout) { - val input = MemoryStream[String] - val df = input.toDF().groupBy("value").count() - val query = df.writeStream -.format("console") -.outputMode("complete") -.option("checkpointLocation", checkpointLocation.getAbsolutePath) -.start() - input.addData("a") - query.processAllAvailable() - input.addData("a", "b") - query.processAllAvailable() - input.addData("a", "b", "c") - query.processAllAvailable() - query.stop() -} -System.out.flush() - } finally { -System.setOut(origOut) - } - - val expected = """--- -|Batch: 0 -|--- -|+-+-+ -||value|count| -|+-+-+ -||a|1| -|+-+-+ -| -|--- -|Batch: 1 -|--- -|+-+-+ -||value|c
spark git commit: [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2.
Repository: spark Updated Branches: refs/heads/branch-1.6 208348595 -> 16b7f1dfc [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2. There's actually a race here: the state of the handler was changed before the connection was set, so the test code could be notified of the state change, wake up, and still see the connection as null, triggering the assert. Author: Marcelo Vanzin Closes #12785 from vanzin/SPARK-14391. (cherry picked from commit 73c20bf32524c2232febc8c4b12d5fa228347163) Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16b7f1df Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16b7f1df Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16b7f1df Branch: refs/heads/branch-1.6 Commit: 16b7f1dfc0570f32e23f640e063d8e7fd9115792 Parents: 2083485 Author: Marcelo Vanzin Authored: Fri Apr 29 23:13:50 2016 -0700 Committer: Marcelo Vanzin Committed: Mon Jun 20 09:55:06 2016 -0700 -- .../src/main/java/org/apache/spark/launcher/LauncherServer.java| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16b7f1df/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java index 414ffc2..e493514 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java +++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java @@ -298,8 +298,8 @@ class LauncherServer implements Closeable { Hello hello = (Hello) msg; ChildProcAppHandle handle = pending.remove(hello.secret); if (handle != null) { -handle.setState(SparkAppHandle.State.CONNECTED); handle.setConnection(this); +handle.setState(SparkAppHandle.State.CONNECTED); this.handle = handle; } else { throw new IllegalArgumentException("Received Hello for unknown client."); - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables
Repository: spark Updated Branches: refs/heads/branch-2.0 19397caab -> 0b0b5fe54 [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. **Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.** Author: Yin Huai Closes #13769 from yhuai/SPARK-16030-1. (cherry picked from commit 905f774b71f4b814d5a2412c7c35bd023c3dfdf8) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b0b5fe5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b0b5fe5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b0b5fe5 Branch: refs/heads/branch-2.0 Commit: 0b0b5fe549086171d851d7c4458d48be9409380f Parents: 19397ca Author: Yin Huai Authored: Mon Jun 20 20:17:47 2016 +0800 Committer: Cheng Lian Committed: Mon Jun 20 20:18:17 2016 +0800 -- .../sql/catalyst/analysis/CheckAnalysis.scala | 19 ++ .../datasources/DataSourceStrategy.scala| 127 +++- .../spark/sql/execution/datasources/rules.scala | 7 - .../spark/sql/internal/SessionState.scala | 2 +- .../sql/sources/DataSourceAnalysisSuite.scala | 202 +++ .../spark/sql/hive/HiveSessionState.scala | 2 +- .../hive/execution/InsertIntoHiveTable.scala| 3 +- .../sql/hive/InsertIntoHiveTableSuite.scala | 97 - .../sql/hive/execution/HiveQuerySuite.scala | 2 +- 9 files changed, 436 insertions(+), 25 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0b0b5fe5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 7b451ba..8992276 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -313,6 +313,8 @@ trait CheckAnalysis extends PredicateHelper { |${s.catalogTable.identifier} """.stripMargin) + // TODO: We need to consolidate this kind of checks for InsertIntoTable + // with the rule of PreWriteCheck defined in extendedCheckRules. case InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _) => failAnalysis( s""" @@ -320,6 +322,23 @@ trait CheckAnalysis extends PredicateHelper { |${s.catalogTable.identifier} """.stripMargin) + case InsertIntoTable(t, _, _, _, _) +if !t.isInstanceOf[LeafNode] || + t == OneRowRelation || + t.isInstanceOf[LocalRelation] => +failAnalysis(s"Inserting into an RDD-based table is not allowed.") + + case i @ InsertIntoTable(table, partitions, query, _, _) => +val numStaticPartitions = partitions.values.count(_.isDefined) +if (table.output.size != (query.output.size + numStaticPartitions)) { + failAnalysis( +s"$table requires that the data to be inserted have the same number of " + + s"columns as the target table: target table has ${table.output.size} " + + s"column(s) but the inserted data has " + + s"${query.output.size + numStaticPartitions} column(s), including " + + s"$numStaticPartitions partition column(s) having constant value(s).") +} + case o if !o.resolved => failAnalysis( s"unresolved operator ${operator.simpleString}") http://git-wip-us.apache.org/repos/asf/spark/blob/0b0b5fe5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index 2b47865..27133f0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -22,15 +22,15 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.internal.Logging
spark git commit: [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables
Repository: spark Updated Branches: refs/heads/master 6d0f921ae -> 905f774b7 [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. **Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.** Author: Yin Huai Closes #13769 from yhuai/SPARK-16030-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/905f774b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/905f774b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/905f774b Branch: refs/heads/master Commit: 905f774b71f4b814d5a2412c7c35bd023c3dfdf8 Parents: 6d0f921 Author: Yin Huai Authored: Mon Jun 20 20:17:47 2016 +0800 Committer: Cheng Lian Committed: Mon Jun 20 20:17:47 2016 +0800 -- .../sql/catalyst/analysis/CheckAnalysis.scala | 19 ++ .../datasources/DataSourceStrategy.scala| 127 +++- .../spark/sql/execution/datasources/rules.scala | 7 - .../spark/sql/internal/SessionState.scala | 2 +- .../sql/sources/DataSourceAnalysisSuite.scala | 202 +++ .../spark/sql/hive/HiveSessionState.scala | 2 +- .../hive/execution/InsertIntoHiveTable.scala| 3 +- .../sql/hive/InsertIntoHiveTableSuite.scala | 97 - .../sql/hive/execution/HiveQuerySuite.scala | 2 +- 9 files changed, 436 insertions(+), 25 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/905f774b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 7b451ba..8992276 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -313,6 +313,8 @@ trait CheckAnalysis extends PredicateHelper { |${s.catalogTable.identifier} """.stripMargin) + // TODO: We need to consolidate this kind of checks for InsertIntoTable + // with the rule of PreWriteCheck defined in extendedCheckRules. case InsertIntoTable(s: SimpleCatalogRelation, _, _, _, _) => failAnalysis( s""" @@ -320,6 +322,23 @@ trait CheckAnalysis extends PredicateHelper { |${s.catalogTable.identifier} """.stripMargin) + case InsertIntoTable(t, _, _, _, _) +if !t.isInstanceOf[LeafNode] || + t == OneRowRelation || + t.isInstanceOf[LocalRelation] => +failAnalysis(s"Inserting into an RDD-based table is not allowed.") + + case i @ InsertIntoTable(table, partitions, query, _, _) => +val numStaticPartitions = partitions.values.count(_.isDefined) +if (table.output.size != (query.output.size + numStaticPartitions)) { + failAnalysis( +s"$table requires that the data to be inserted have the same number of " + + s"columns as the target table: target table has ${table.output.size} " + + s"column(s) but the inserted data has " + + s"${query.output.size + numStaticPartitions} column(s), including " + + s"$numStaticPartitions partition column(s) having constant value(s).") +} + case o if !o.resolved => failAnalysis( s"unresolved operator ${operator.simpleString}") http://git-wip-us.apache.org/repos/asf/spark/blob/905f774b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index 2b47865..27133f0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -22,15 +22,15 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.{Ca