[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r71047599
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  
[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)
--- End diff --

I think we need to put `` in ``, eg. 
https://github.com/apache/spark/blame/master/docs/structured-streaming-programming-guide.md#L811


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r71041878
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
--- End diff --

`Below data type` -> `Below is the data type`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r71041809
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  
[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)
--- End diff --

Also not sure why - but the URL formatting doesnt seem to be working here. 
Screenshot of what i see is below

![screenshot 2016-07-15 14 13 
56](https://cloud.githubusercontent.com/assets/143893/16888670/61fede2a-4a96-11e6-8b7f-507f3eb194d4.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r71041580
  
--- Diff: docs/sparkr.md ---
@@ -295,8 +294,7 @@ head(collect(df1))
 
 # dapplyCollect
 Like `dapply`, apply a function to each partition of a `SparkDataFrame` 
and collect the result back. The output of function
-should be a `data.frame`. But, Schema is not required to be passed. Note 
that `dapplyCollect` only can be used if the
-output of UDF run on all the partitions can fit in driver memory.
+should be a `data.frame`. But, Schema is not required to be passed. Note 
that `dapplyCollect` can fail if the output of UDF run on all the partition 
cannot be pulled to the driver and fit in driver memory.
 
--- End diff --

I think we need a new line before the `` ? Right now the `div` 
markings show up in the generated doc. I've attached a screenshot

![screenshot 2016-07-15 14 11 
39](https://cloud.githubusercontent.com/assets/143893/16888609/1d4409fe-4a96-11e6-97db-6ebf05a03774.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70926563
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

And `environment` instead of `env`?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html
```
> e <- new.env()
> class(e)
[1] "environment"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70926341
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

yes it should be `Date` not `date`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70923795
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

Not really - as I mentioned the getSQLDatatype looks at the schema - the 
method which looks at the R objects is in 
https://github.com/apache/spark/blob/2e4075e2ece9574100c79558cab054485e25c2ee/R/pkg/R/serialize.R#L84


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70923645
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

Sounds good. for the mapping between: 'POSIXct / POSIXlt' to 'timestamp' 
and 'Date' to 'date' do we need to update 'getSQLDataType' method ?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70922863
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

And as you mentioned above we can also change `date` to `Date` to be more 
specific. (It would be ideal now that I think to link these R types to the CRAN 
help page. For example we can link to 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html 
for `POSIXct / POSIXlt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70922747
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

We can remove map, struct. For timestamp lets replace the R side of the 
table with `POSIXct` / `POSIXlt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70921996
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

Thanks for the explanation, @shivaram !
So, I'll remove map, struct and timestamp and leave the rest as is.
Does it sound fine ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70920785
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

Thats a good point - So users can create a schema with `struct` and that is 
mapping to a corresponding SQL type. But they can't create any R objects that 
will be parsed as `struct`. The main reason our schema is more flexible than 
our serialization / deserialization support is that the schema can be used to 
say read JSON files or JDBC tables etc.

For the use case here, where users are returning a `data.frame` from UDF I 
dont think there is any valid mapping for `struct` from R. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70920518
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

@shivaram, I've looked at the following list:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92
It is being called for creating schema's field and it has map, struct, 
timestamp, etc ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70920244
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

@felixcheung, I think according to the following mapping we expect 'date':

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
And it seems that there is a 'Date' in base. Do I understand correct ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70905195
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

I don't think `date` is a type either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-14 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70846132
  
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of the returned `data.frame` are 
set by user. Below data type mapping between R
+and Spark.
+
+ Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  timestamp
+  timestamp
+
+
+  date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  map
+  map
+
+
+  env
+  map
+
+
+  struct
--- End diff --

I dont think R has any notion of a `struct` or `map` data type ? Looking at 
the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I 
think we should remove the struct -> struct and map -> map entries. Also I dont 
think there is a `timestamp` class in R. We should probably replace that with 
`POSIXct` or `POSIXlt`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70711218
  
--- Diff: docs/sparkr.md ---
@@ -312,7 +310,82 @@ head(ldf, 3)
 Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
 that key. The groups are chosen from `SparkDataFrame`s column(s).
 The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
-`SparkDataFrame`. It must match the R function's output.
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of each output field in the schema 
are set by user. Bellow data type mapping between R
--- End diff --

`Bellow` should be `Below`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70711263
  
--- Diff: docs/sparkr.md ---
@@ -312,7 +310,82 @@ head(ldf, 3)
 Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
 that key. The groups are chosen from `SparkDataFrame`s column(s).
 The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
-`SparkDataFrame`. It must match the R function's output.
+`SparkDataFrame`. It must represent R function's output schema on the 
basis of Spark data types. The column names of each output field in the schema 
are set by user. Bellow data type mapping between R
--- End diff --

same, `output field` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r7076
  
--- Diff: docs/sparkr.md ---
@@ -263,7 +263,7 @@ In SparkR, we support several kinds of User-Defined 
Functions:
 # dapply
 Apply a function to each partition of a `SparkDataFrame`. The function to 
be applied to each partition of the `SparkDataFrame`
 and should have only one parameter, to which a `data.frame` corresponds to 
each partition will be passed. The output of function
-should be a `data.frame`. Schema specifies the row format of the resulting 
a `SparkDataFrame`. It must match the R function's output.
+should be a `data.frame`. Schema specifies the row format of the resulting 
a `SparkDataFrame`. It must match to [data types of R function's output 
fields](#data-type-mapping-between-r-and-spark).
--- End diff --

`output fields` --> `return values` or `return value`?
http://adv-r.had.co.nz/Functions.html#return-values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-11 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70346974
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

I think those mappings are only used to print things in `str`. A better 
list to consult would be the list at 
https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that 
says `list` in R should become a `array` in SparkSQL and `env` in R should map 
to a `map`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70202736
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

Thanks, I was looking at types.R file and have noticed that we have NA's 
for array, map and struct.
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42
But I guess in our case we can have: array, map and struct mapped to array, 
map and struct correspondingly ?!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70202560
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

This looks good to me ! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70202321
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

Thanks @shivaram.
Does the following mapping looks fine to have in the table ?
```
**R   Spark**
byte  byte
integer  integer
float  float
double  double
numericdouble
character  string
stringstring
binary   binary
raw   binary
logical   boolean
timestamptimestamp
date  date
array array
map  map
structstruct
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70202064
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

Yeah but instead of a pointer to the code it would be great if we could 
have a table in the documentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70198331
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

or we could probably refer also to this ?
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-10 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70194370
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

I see. I think we can describe the following type mapping in the 
programming guide. 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
Those are the types used in the StructType's fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70172206
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

I think gapply and dapply are the first important use cases where we 
require strict mapping Spark JVM types to R atomic types. It might be 
worthwhile to add a section in the programming guide to illustrate and explain 
that further.

To be more concrete, what should be the column type of the UDF output R 
data.frame if the SparkDataFrame has a column of double? It would be good to 
have a table on that.

That could be a separate PR though.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-09 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r70168781
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

Thanks @felixcheung, Does this sound better ?
"It must reflect R function's output schema on the basis of Spark data 
types. The column names of each output field in the schema are set by user." I 
could also bring up some examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r7362
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

I suppose this could be explained in `dapply` above as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r69955401
  
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
 {% endhighlight %}
 
 
+ Run a given function on a large dataset grouping by input column(s) 
and using `gapply` or `gapplyCollect`
+
+# gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to 
be applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --

it was hard to do in roxygen2 doc but the programming guide would be a 
great please to touch on or refer to what "match" means exactly - type mapping 
between Spark and R is a bit fuzzy and would be good to explain a bit more on 
that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

2016-07-07 Thread NarineK
GitHub user NarineK opened a pull request:

https://github.com/apache/spark/pull/14090

[SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect

## What changes were proposed in this pull request?

Updates programming guide for spark.gapply/spark.gapplyCollect.

Similar to other examples I used faithful dataset to demonstrate gapply's 
functionality.
Please, let me know if you prefer another example.

## How was this patch tested?
Existing test cases in R

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NarineK/spark gapplyProgGuide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14090


commit 29d8a5c6c22202cdf7d6cc44f1d6cbeca5946918
Author: Narine Kokhlikyan 
Date:   2016-06-20T22:12:11Z

Fixed duplicated documentation problem + separated documentation for dapply 
and dapplyCollect

commit 698c4331d2a8bfe7f4b372ebc8123b6c27a57e68
Author: Narine Kokhlikyan 
Date:   2016-06-23T18:51:48Z

merge with master

commit 85a4493a03b3601a93c25ebc1eafb2868efec8d8
Author: Narine Kokhlikyan 
Date:   2016-07-07T13:18:49Z

Adding programming guide for gapply/gapplyCollect

commit 7781d1c111f38e3608d5ebd468e6d344d52efa5c
Author: Narine Kokhlikyan 
Date:   2016-07-07T13:27:35Z

removing output format




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org