[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette

felixcheung Tue, 06 Sep 2016 15:58:00 -0700

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14980#discussion_r77734065
  
    --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
    @@ -0,0 +1,853 @@
    +---
    +title: "SparkR - Practical Guide"
    +output:
    +  html_document:
    +    theme: united
    +    toc: true
    +    toc_depth: 4
    +    toc_float: true
    +    highlight: textmate
    +---
    +
    +## Overview
    +
    +SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R. In Spark 2.0.0, SparkR provides a distributed data frame 
implementation that supports data processing operations like selection, 
filtering, aggregation etc. and distributed machine learning using 
[MLlib](http://spark.apache.org/mllib/).
    +
    +## Getting Started
    +
    +We begin with an example running on the local machine and provide an 
overview of the use of SparkR: data ingestion, data processing and machine 
learning.
    +
    +First, let's load and attach the package.
    +```{r, message=FALSE}
    +library(SparkR)
    +```
    +
    +`SparkSession` is the entry point into SparkR which connects your R 
program to a Spark cluster. You can create a `SparkSession` using 
`sparkR.session` and pass in options such as the application name, any Spark 
packages depended on, etc.
    +
    +We use default settings in which it runs in local mode. It auto downloads 
Spark package in the background if no previous installation is found. For more 
details about setup, see [Spark Session](#SetupSparkSession).
    +
    +```{r, message=FALSE, warning=FALSE}
    +sparkR.session()
    +```
    +
    +The operations in SparkR are centered around an R class called 
`SparkDataFrame`. It is a distributed collection of data organized into named 
columns, which is conceptually equivalent to a table in a relational database 
or a data frame in R, but with richer optimizations under the hood.
    +
    +`SparkDataFrame` can be constructed from a wide array of sources such as: 
structured data files, tables in Hive, external databases, or existing local R 
data frames. For example, we create a `SparkDataFrame` from a local R data 
frame,
    +
    +```{r}
    +cars <- cbind(model = rownames(mtcars), mtcars)
    +carsDF <- createDataFrame(cars)
    +```
    +
    +We can view the first few rows of the `SparkDataFrame` by `showDF` or 
`head` function.
    +```{r}
    +showDF(carsDF)
    +```
    +
    +Common data processing operations such as `filter`, `select` are supported 
on the `SparkDataFrame`.
    +```{r}
    +carsSubDF <- select(carsDF, "model", "mpg", "hp")
    +carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
    +showDF(carsSubDF)
    +```
    +
    +SparkR can use many common aggregation functions after grouping.
    +
    +```{r}
    +carsGPDF <- summarize(groupBy(carsDF, carsDF$gear), count = n(carsDF$gear))
    +showDF(carsGPDF)
    +```
    +
    +The results `carsDF` and `carsSubDF` are `SparkDataFrame` objects. To 
convert back to R `data.frame`, we can use `collect`.
    --- End diff --
    
    should we add a note here that caution collect on large distributed dataset?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette

Reply via email to