spark git commit: [SPARK-15672][R][DOC] R programming guide update

2016-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 e043c02d0 -> 299f427b7


[SPARK-15672][R][DOC] R programming guide update

## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png";>

Author: Kai Jiang 

Closes #13660 from vectorijk/spark-15672-R-guide-update.

(cherry picked from commit 43b04b7ecb313a2cee6121dd575de1f7dc785c11)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/299f427b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/299f427b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/299f427b

Branch: refs/heads/branch-2.0
Commit: 299f427b70f8dedbc0b554f83c4fde408caf4d15
Parents: e043c02
Author: Kai Jiang 
Authored: Wed Jun 22 12:50:36 2016 -0700
Committer: Joseph K. Bradley 
Committed: Wed Jun 22 12:50:44 2016 -0700

--
 R/pkg/R/context.R |  2 +-
 docs/sparkr.md| 77 ++
 2 files changed, 78 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/299f427b/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 96ef943..dd0ceae 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -246,7 +246,7 @@ setCheckpointDir <- function(sc, dirName) {
 #'   \preformatted{
 #' train <- function(hyperparam) {
 #'   library(MASS)
-#'   lm.ridge(“y ~ x+z”, data, lambda=hyperparam)
+#'   lm.ridge("y ~ x+z", data, lambda=hyperparam)
 #'   model
 #' }
 #'   }

http://git-wip-us.apache.org/repos/asf/spark/blob/299f427b/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index f018901..9e74e4a 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -255,6 +255,83 @@ head(df)
 {% endhighlight %}
 
 
+### Applying User-Defined Function
+In SparkR, we support several kinds of User-Defined Functions:
+
+ Run a given function on a large dataset using `dapply` or `dapplyCollect`
+
+# dapply
+Apply a function to each partition of a `SparkDataFrame`. The function to be 
applied to each partition of the `SparkDataFrame`
+and should have only one parameter, to which a `data.frame` corresponds to 
each partition will be passed. The output of function
+should be a `data.frame`. Schema specifies the row format of the resulting a 
`SparkDataFrame`. It must match the R function's output.
+
+{% highlight r %}
+
+# Convert waiting time from hours to seconds.
+# Note that we can apply UDF to DataFrame.
+schema <- structType(structField("eruptions", "double"), 
structField("waiting", "double"),
+ structField("waiting_secs", "double"))
+df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
+head(collect(df1))
+##  eruptions waiting waiting_secs
+##1 3.600  79 4740
+##2 1.800  54 3240
+##3 3.333  74 4440
+##4 2.283  62 3720
+##5 4.533  85 5100
+##6 2.883  55 3300
+{% endhighlight %}
+
+
+# dapplyCollect
+Like `dapply`, apply a function to each partition of a `SparkDataFrame` and 
collect the result back. The output of function
+should be a `data.frame`. But, Schema is not required to be passed. Note that 
`dapplyCollect` only can be used if the
+output of UDF run on all the partitions can fit in driver memory.
+
+{% highlight r %}
+
+# Convert waiting time from hours to seconds.
+# Note that we can apply UDF to DataFrame and return a R's data.frame
+ldf <- dapplyCollect(
+ df,
+ function(x) {
+   x <- cbind(x, "waiting_secs" = x$waiting * 60)
+ })
+head(ldf, 3)
+##  eruptions waiting waiting_secs
+##1 3.600  79 4740
+##2 1.800  54 3240
+##3 3.333  74 4440
+
+{% endhighlight %}
+
+
+ Run local R functions distributed using `spark.lapply`
+
+# spark.lapply
+Similar to `lapply` in native R, `spark.lapply` runs a function over a list of 
elements and distributes the computations with Spark.
+Applies a function in a manner that is similar to `doParallel` or `lapply` to 
elements of a list. The results of all the computations
+should fit in a single machine. If that is not the case they can do something 
like `df <- createDataFrame(list)` and then use
+`dapply`
+
+{% highlight r %}
+
+# Perform distributed training of multiple models with spark.lapply. Here, we 
pass
+# a read-only list of arguments which specifies family the generalized linear 
model shou

spark git commit: [SPARK-15672][R][DOC] R programming guide update

2016-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 6f915c9ec -> 43b04b7ec


[SPARK-15672][R][DOC] R programming guide update

## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png";>

Author: Kai Jiang 

Closes #13660 from vectorijk/spark-15672-R-guide-update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43b04b7e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43b04b7e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43b04b7e

Branch: refs/heads/master
Commit: 43b04b7ecb313a2cee6121dd575de1f7dc785c11
Parents: 6f915c9
Author: Kai Jiang 
Authored: Wed Jun 22 12:50:36 2016 -0700
Committer: Joseph K. Bradley 
Committed: Wed Jun 22 12:50:36 2016 -0700

--
 R/pkg/R/context.R |  2 +-
 docs/sparkr.md| 77 ++
 2 files changed, 78 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/43b04b7e/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 96ef943..dd0ceae 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -246,7 +246,7 @@ setCheckpointDir <- function(sc, dirName) {
 #'   \preformatted{
 #' train <- function(hyperparam) {
 #'   library(MASS)
-#'   lm.ridge(“y ~ x+z”, data, lambda=hyperparam)
+#'   lm.ridge("y ~ x+z", data, lambda=hyperparam)
 #'   model
 #' }
 #'   }

http://git-wip-us.apache.org/repos/asf/spark/blob/43b04b7e/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index f018901..9e74e4a 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -255,6 +255,83 @@ head(df)
 {% endhighlight %}
 
 
+### Applying User-Defined Function
+In SparkR, we support several kinds of User-Defined Functions:
+
+ Run a given function on a large dataset using `dapply` or `dapplyCollect`
+
+# dapply
+Apply a function to each partition of a `SparkDataFrame`. The function to be 
applied to each partition of the `SparkDataFrame`
+and should have only one parameter, to which a `data.frame` corresponds to 
each partition will be passed. The output of function
+should be a `data.frame`. Schema specifies the row format of the resulting a 
`SparkDataFrame`. It must match the R function's output.
+
+{% highlight r %}
+
+# Convert waiting time from hours to seconds.
+# Note that we can apply UDF to DataFrame.
+schema <- structType(structField("eruptions", "double"), 
structField("waiting", "double"),
+ structField("waiting_secs", "double"))
+df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
+head(collect(df1))
+##  eruptions waiting waiting_secs
+##1 3.600  79 4740
+##2 1.800  54 3240
+##3 3.333  74 4440
+##4 2.283  62 3720
+##5 4.533  85 5100
+##6 2.883  55 3300
+{% endhighlight %}
+
+
+# dapplyCollect
+Like `dapply`, apply a function to each partition of a `SparkDataFrame` and 
collect the result back. The output of function
+should be a `data.frame`. But, Schema is not required to be passed. Note that 
`dapplyCollect` only can be used if the
+output of UDF run on all the partitions can fit in driver memory.
+
+{% highlight r %}
+
+# Convert waiting time from hours to seconds.
+# Note that we can apply UDF to DataFrame and return a R's data.frame
+ldf <- dapplyCollect(
+ df,
+ function(x) {
+   x <- cbind(x, "waiting_secs" = x$waiting * 60)
+ })
+head(ldf, 3)
+##  eruptions waiting waiting_secs
+##1 3.600  79 4740
+##2 1.800  54 3240
+##3 3.333  74 4440
+
+{% endhighlight %}
+
+
+ Run local R functions distributed using `spark.lapply`
+
+# spark.lapply
+Similar to `lapply` in native R, `spark.lapply` runs a function over a list of 
elements and distributes the computations with Spark.
+Applies a function in a manner that is similar to `doParallel` or `lapply` to 
elements of a list. The results of all the computations
+should fit in a single machine. If that is not the case they can do something 
like `df <- createDataFrame(list)` and then use
+`dapply`
+
+{% highlight r %}
+
+# Perform distributed training of multiple models with spark.lapply. Here, we 
pass
+# a read-only list of arguments which specifies family the generalized linear 
model should be.
+families <- c("gaussian", "poisson")
+train <- function(family) {
+  model <- glm(Sepal.Length ~ Sepal.W