spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example

shivaram Wed, 13 Jul 2016 15:09:59 -0700

Repository: spark
Updated Branches:
  refs/heads/master b4baf086c -> fb2e8eeb0



[SPARKR][DOCS][MINOR] R programming guide to include csv data source example

## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference 
to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheun...@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb2e8eeb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb2e8eeb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb2e8eeb

Branch: refs/heads/master
Commit: fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3
Parents: b4baf08
Author: Felix Cheung <felixcheun...@hotmail.com>
Authored: Wed Jul 13 15:09:23 2016 -0700
Committer: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Committed: Wed Jul 13 15:09:23 2016 -0700

----------------------------------------------------------------------
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  2 +-
 docs/sparkr.md                            | 27 +++++++++++++++++---------
 2 files changed, 19 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/R/pkg/inst/tests/testthat/test_sparkSQL.R
----------------------------------------------------------------------
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index fdd6020..e61fa41 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -237,7 +237,7 @@ test_that("read csv as DataFrame", {
                    "Empty,Dummy,Placeholder")
   writeLines(mockLinesCsv, csvPath)
 
-  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.string = "Empty")
+  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "Empty")
   expect_equal(count(df2), 4)
   withoutna2 <- na.omit(df2, how = "any", cols = "year")
   expect_equal(count(withoutna2), 3)

http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/docs/sparkr.md
----------------------------------------------------------------------
diff --git a/docs/sparkr.md b/docs/sparkr.md
index b4acb23..9fda0ec 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -111,19 +111,17 @@ head(df)
 SparkR supports operating on a variety of data sources through the 
`SparkDataFrame` interface. This section describes the general methods for 
loading and saving data using Data Sources. You can check the Spark SQL 
programming guide for more [specific 
options](sql-programming-guide.html#manually-specifying-options) that are 
available for the built-in data sources.
 
 The general method for creating SparkDataFrames from data sources is 
`read.df`. This method takes in the path for the file to load and the type of 
data source, and the currently active SparkSession will be used automatically. 
SparkR supports reading JSON, CSV and Parquet files natively and through [Spark 
Packages](http://spark-packages.org/) you can find data source connectors for 
popular file formats like 
[Avro](http://spark-packages.org/package/databricks/spark-avro). These packages 
can either be added by
-specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
creating context through `init`
-you can specify the packages with the `packages` argument.
+specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
initializing SparkSession with `sparkPackages` parameter when in an interactive 
R shell or from RStudio.
 
 <div data-lang="r" markdown="1">
 {% highlight r %}
-sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0")
+sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
 {% endhighlight %}
 </div>
 
 We can see how to use data sources using an example JSON input file. Note that 
the file that is used here is _not_ a typical JSON file. Each line in the file 
must contain a separate, self-contained valid JSON object. As a consequence, a 
regular multi-line JSON file will most often fail.
 
 <div data-lang="r"  markdown="1">
-
 {% highlight r %}
 people <- read.df("./examples/src/main/resources/people.json", "json")
 head(people)
@@ -138,6 +136,18 @@ printSchema(people)
 #  |-- age: long (nullable = true)
 #  |-- name: string (nullable = true)
 
+# Similarly, multiple files can be read with read.json
+people <- read.json(c("./examples/src/main/resources/people.json", 
"./examples/src/main/resources/people2.json"))
+
+{% endhighlight %}
+</div>
+
+The data sources API natively supports CSV formatted input files. For more 
information please refer to SparkR [read.df](api/R/read.df.html) API 
documentation.
+
+<div data-lang="r"  markdown="1">
+{% highlight r %}
+df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "NA")
+
 {% endhighlight %}
 </div>
 
@@ -146,7 +156,7 @@ to a Parquet file using `write.df`.
 
 <div data-lang="r"  markdown="1">
 {% highlight r %}
-write.df(people, path="people.parquet", source="parquet", mode="overwrite")
+write.df(people, path = "people.parquet", source = "parquet", mode = 
"overwrite")
 {% endhighlight %}
 </div>
 
@@ -264,14 +274,14 @@ In SparkR, we support several kinds of User-Defined 
Functions:
 Apply a function to each partition of a `SparkDataFrame`. The function to be 
applied to each partition of the `SparkDataFrame`
 and should have only one parameter, to which a `data.frame` corresponds to 
each partition will be passed. The output of function
 should be a `data.frame`. Schema specifies the row format of the resulting a 
`SparkDataFrame`. It must match the R function's output.
+
 <div data-lang="r"  markdown="1">
 {% highlight r %}
-
 # Convert waiting time from hours to seconds.
 # Note that we can apply UDF to DataFrame.
 schema <- structType(structField("eruptions", "double"), 
structField("waiting", "double"),
                      structField("waiting_secs", "double"))
-df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
+df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)
 head(collect(df1))
 ##  eruptions waiting waiting_secs
 ##1     3.600      79         4740
@@ -313,9 +323,9 @@ Similar to `lapply` in native R, `spark.lapply` runs a 
function over a list of e
 Applies a function in a manner that is similar to `doParallel` or `lapply` to 
elements of a list. The results of all the computations
 should fit in a single machine. If that is not the case they can do something 
like `df <- createDataFrame(list)` and then use
 `dapply`
+
 <div data-lang="r"  markdown="1">
 {% highlight r %}
-
 # Perform distributed training of multiple models with spark.lapply. Here, we 
pass
 # a read-only list of arguments which specifies family the generalized linear 
model should be.
 families <- c("gaussian", "poisson")
@@ -436,4 +446,3 @@ You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-ma
  - The method `registerTempTable` has been deprecated to be replaced by 
`createOrReplaceTempView`.
  - The method `dropTempTable` has been deprecated to be replaced by 
`dropTempView`.
  - The `sc` SparkContext parameter is no longer required for these functions: 
`setJobGroup`, `clearJobGroup`, `cancelJobGroup`
- 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example

Reply via email to