spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
Repository: spark Updated Branches: refs/heads/branch-2.0 18255a934 -> 9e3a59858 [SPARKR][DOCS][MINOR] R programming guide to include csv data source example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung Closes #14178 from felixcheung/rcsvprogrammingguide. (cherry picked from commit fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e3a5985 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e3a5985 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e3a5985 Branch: refs/heads/branch-2.0 Commit: 9e3a598582c747194188f8ad15b43aca03907bae Parents: 18255a9 Author: Felix Cheung Authored: Wed Jul 13 15:09:23 2016 -0700 Committer: Shivaram Venkataraman Committed: Wed Jul 13 15:09:31 2016 -0700 -- R/pkg/inst/tests/testthat/test_sparkSQL.R | 2 +- docs/sparkr.md| 27 +- 2 files changed, 19 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e3a5985/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index bd7b5f0..e26b015 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -237,7 +237,7 @@ test_that("read csv as DataFrame", { "Empty,Dummy,Placeholder") writeLines(mockLinesCsv, csvPath) - df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.string = "Empty") + df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "Empty") expect_equal(count(df2), 4) withoutna2 <- na.omit(df2, how = "any", cols = "year") expect_equal(count(withoutna2), 3) http://git-wip-us.apache.org/repos/asf/spark/blob/9e3a5985/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index b4acb23..9fda0ec 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -111,19 +111,17 @@ head(df) SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources. The general method for creating SparkDataFrames from data sources is `read.df`. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically. SparkR supports reading JSON, CSV and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [Avro](http://spark-packages.org/package/databricks/spark-avro). These packages can either be added by -specifying `--packages` with `spark-submit` or `sparkR` commands, or if creating context through `init` -you can specify the packages with the `packages` argument. +specifying `--packages` with `spark-submit` or `sparkR` commands, or if initializing SparkSession with `sparkPackages` parameter when in an interactive R shell or from RStudio. {% highlight r %} -sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0") +sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") {% endhighlight %} We can see how to use data sources using an example JSON input file. Note that the file that is used here is _not_ a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. - {% highlight r %} people <- read.df("./examples/src/main/resources/people.json", "json") head(people) @@ -138,6 +136,18 @@ printSchema(people) # |-- age: long (nullable = true) # |-- name: string (nullable = true) +# Similarly, multiple files can be read with read.json +people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people2.json")) + +{% endhighlight %} + + +The data sources API natively supports CSV formatted input files. For more information please refer to SparkR [read.df](api/R/read.df.html) API documentation. + + +{% highlight r %} +df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA") + {% endhighlight %} @@ -146,7 +156,7 @@
spark git commit: [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
Repository: spark Updated Branches: refs/heads/master b4baf086c -> fb2e8eeb0 [SPARKR][DOCS][MINOR] R programming guide to include csv data source example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung Closes #14178 from felixcheung/rcsvprogrammingguide. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb2e8eeb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb2e8eeb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb2e8eeb Branch: refs/heads/master Commit: fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3 Parents: b4baf08 Author: Felix Cheung Authored: Wed Jul 13 15:09:23 2016 -0700 Committer: Shivaram Venkataraman Committed: Wed Jul 13 15:09:23 2016 -0700 -- R/pkg/inst/tests/testthat/test_sparkSQL.R | 2 +- docs/sparkr.md| 27 +- 2 files changed, 19 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index fdd6020..e61fa41 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -237,7 +237,7 @@ test_that("read csv as DataFrame", { "Empty,Dummy,Placeholder") writeLines(mockLinesCsv, csvPath) - df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.string = "Empty") + df2 <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "Empty") expect_equal(count(df2), 4) withoutna2 <- na.omit(df2, how = "any", cols = "year") expect_equal(count(withoutna2), 3) http://git-wip-us.apache.org/repos/asf/spark/blob/fb2e8eeb/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index b4acb23..9fda0ec 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -111,19 +111,17 @@ head(df) SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources. The general method for creating SparkDataFrames from data sources is `read.df`. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically. SparkR supports reading JSON, CSV and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [Avro](http://spark-packages.org/package/databricks/spark-avro). These packages can either be added by -specifying `--packages` with `spark-submit` or `sparkR` commands, or if creating context through `init` -you can specify the packages with the `packages` argument. +specifying `--packages` with `spark-submit` or `sparkR` commands, or if initializing SparkSession with `sparkPackages` parameter when in an interactive R shell or from RStudio. {% highlight r %} -sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0") +sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") {% endhighlight %} We can see how to use data sources using an example JSON input file. Note that the file that is used here is _not_ a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. - {% highlight r %} people <- read.df("./examples/src/main/resources/people.json", "json") head(people) @@ -138,6 +136,18 @@ printSchema(people) # |-- age: long (nullable = true) # |-- name: string (nullable = true) +# Similarly, multiple files can be read with read.json +people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people2.json")) + +{% endhighlight %} + + +The data sources API natively supports CSV formatted input files. For more information please refer to SparkR [read.df](api/R/read.df.html) API documentation. + + +{% highlight r %} +df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA") + {% endhighlight %} @@ -146,7 +156,7 @@ to a Parquet file using `write.df`. {% highlight r %} -write.df(people, path="people.parquet", source="parquet