spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example

2016-07-13 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 18255a934 -> 9e3a59858


[SPARKR][DOCS][MINOR] R programming guide to include csv data source example

## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference 
to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung 

Closes #14178 from felixcheung/rcsvprogrammingguide.

(cherry picked from commit fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e3a5985
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e3a5985
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e3a5985

Branch: refs/heads/branch-2.0
Commit: 9e3a598582c747194188f8ad15b43aca03907bae
Parents: 18255a9
Author: Felix Cheung 
Authored: Wed Jul 13 15:09:23 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Wed Jul 13 15:09:31 2016 -0700

--
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  2 +-
 docs/sparkr.md| 27 +-
 2 files changed, 19 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9e3a5985/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index bd7b5f0..e26b015 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -237,7 +237,7 @@ test_that("read csv as DataFrame", {
"Empty,Dummy,Placeholder")
   writeLines(mockLinesCsv, csvPath)
 
-  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.string = "Empty")
+  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "Empty")
   expect_equal(count(df2), 4)
   withoutna2 <- na.omit(df2, how = "any", cols = "year")
   expect_equal(count(withoutna2), 3)

http://git-wip-us.apache.org/repos/asf/spark/blob/9e3a5985/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index b4acb23..9fda0ec 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -111,19 +111,17 @@ head(df)
 SparkR supports operating on a variety of data sources through the 
`SparkDataFrame` interface. This section describes the general methods for 
loading and saving data using Data Sources. You can check the Spark SQL 
programming guide for more [specific 
options](sql-programming-guide.html#manually-specifying-options) that are 
available for the built-in data sources.
 
 The general method for creating SparkDataFrames from data sources is 
`read.df`. This method takes in the path for the file to load and the type of 
data source, and the currently active SparkSession will be used automatically. 
SparkR supports reading JSON, CSV and Parquet files natively and through [Spark 
Packages](http://spark-packages.org/) you can find data source connectors for 
popular file formats like 
[Avro](http://spark-packages.org/package/databricks/spark-avro). These packages 
can either be added by
-specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
creating context through `init`
-you can specify the packages with the `packages` argument.
+specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
initializing SparkSession with `sparkPackages` parameter when in an interactive 
R shell or from RStudio.
 
 
 {% highlight r %}
-sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0")
+sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
 {% endhighlight %}
 
 
 We can see how to use data sources using an example JSON input file. Note that 
the file that is used here is _not_ a typical JSON file. Each line in the file 
must contain a separate, self-contained valid JSON object. As a consequence, a 
regular multi-line JSON file will most often fail.
 
 
-
 {% highlight r %}
 people <- read.df("./examples/src/main/resources/people.json", "json")
 head(people)
@@ -138,6 +136,18 @@ printSchema(people)
 #  |-- age: long (nullable = true)
 #  |-- name: string (nullable = true)
 
+# Similarly, multiple files can be read with read.json
+people <- read.json(c("./examples/src/main/resources/people.json", 
"./examples/src/main/resources/people2.json"))
+
+{% endhighlight %}
+
+
+The data sources API natively supports CSV formatted input files. For more 
information please refer to SparkR [read.df](api/R/read.df.html) API 
documentation.
+
+
+{% highlight r %}
+df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "NA")
+
 {% endhighlight %}
 
 
@@ -146,7 +156,7 @@

spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example

2016-07-13 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master b4baf086c -> fb2e8eeb0


[SPARKR][DOCS][MINOR] R programming guide to include csv data source example

## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference 
to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung 

Closes #14178 from felixcheung/rcsvprogrammingguide.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb2e8eeb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb2e8eeb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb2e8eeb

Branch: refs/heads/master
Commit: fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3
Parents: b4baf08
Author: Felix Cheung 
Authored: Wed Jul 13 15:09:23 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Wed Jul 13 15:09:23 2016 -0700

--
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  2 +-
 docs/sparkr.md| 27 +-
 2 files changed, 19 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index fdd6020..e61fa41 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -237,7 +237,7 @@ test_that("read csv as DataFrame", {
"Empty,Dummy,Placeholder")
   writeLines(mockLinesCsv, csvPath)
 
-  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.string = "Empty")
+  df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "Empty")
   expect_equal(count(df2), 4)
   withoutna2 <- na.omit(df2, how = "any", cols = "year")
   expect_equal(count(withoutna2), 3)

http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index b4acb23..9fda0ec 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -111,19 +111,17 @@ head(df)
 SparkR supports operating on a variety of data sources through the 
`SparkDataFrame` interface. This section describes the general methods for 
loading and saving data using Data Sources. You can check the Spark SQL 
programming guide for more [specific 
options](sql-programming-guide.html#manually-specifying-options) that are 
available for the built-in data sources.
 
 The general method for creating SparkDataFrames from data sources is 
`read.df`. This method takes in the path for the file to load and the type of 
data source, and the currently active SparkSession will be used automatically. 
SparkR supports reading JSON, CSV and Parquet files natively and through [Spark 
Packages](http://spark-packages.org/) you can find data source connectors for 
popular file formats like 
[Avro](http://spark-packages.org/package/databricks/spark-avro). These packages 
can either be added by
-specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
creating context through `init`
-you can specify the packages with the `packages` argument.
+specifying `--packages` with `spark-submit` or `sparkR` commands, or if 
initializing SparkSession with `sparkPackages` parameter when in an interactive 
R shell or from RStudio.
 
 
 {% highlight r %}
-sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0")
+sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
 {% endhighlight %}
 
 
 We can see how to use data sources using an example JSON input file. Note that 
the file that is used here is _not_ a typical JSON file. Each line in the file 
must contain a separate, self-contained valid JSON object. As a consequence, a 
regular multi-line JSON file will most often fail.
 
 
-
 {% highlight r %}
 people <- read.df("./examples/src/main/resources/people.json", "json")
 head(people)
@@ -138,6 +136,18 @@ printSchema(people)
 #  |-- age: long (nullable = true)
 #  |-- name: string (nullable = true)
 
+# Similarly, multiple files can be read with read.json
+people <- read.json(c("./examples/src/main/resources/people.json", 
"./examples/src/main/resources/people2.json"))
+
+{% endhighlight %}
+
+
+The data sources API natively supports CSV formatted input files. For more 
information please refer to SparkR [read.df](api/R/read.df.html) API 
documentation.
+
+
+{% highlight r %}
+df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", 
na.strings = "NA")
+
 {% endhighlight %}
 
 
@@ -146,7 +156,7 @@ to a Parquet file using `write.df`.
 
 
 {% highlight r %}
-write.df(people, path="people.parquet", source="parquet