[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14090 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r71047599 --- Diff: docs/sparkr.md --- @@ -316,6 +314,135 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + [POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html) --- End diff -- I think we need to put `` in ``, eg. https://github.com/apache/spark/blame/master/docs/structured-streaming-programming-guide.md#L811 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r71041878 --- Diff: docs/sparkr.md --- @@ -316,6 +314,135 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R --- End diff -- `Below data type` -> `Below is the data type` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r71041809 --- Diff: docs/sparkr.md --- @@ -316,6 +314,135 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + [POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html) --- End diff -- Also not sure why - but the URL formatting doesnt seem to be working here. Screenshot of what i see is below ![screenshot 2016-07-15 14 13 56](https://cloud.githubusercontent.com/assets/143893/16888670/61fede2a-4a96-11e6-8b7f-507f3eb194d4.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r71041580 --- Diff: docs/sparkr.md --- @@ -295,8 +294,7 @@ head(collect(df1)) # dapplyCollect Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function -should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the -output of UDF run on all the partitions can fit in driver memory. +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory. --- End diff -- I think we need a new line before the `` ? Right now the `div` markings show up in the generated doc. I've attached a screenshot ![screenshot 2016-07-15 14 11 39](https://cloud.githubusercontent.com/assets/143893/16888609/1d4409fe-4a96-11e6-97db-6ebf05a03774.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70926563 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- And `environment` instead of `env`? https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html ``` > e <- new.env() > class(e) [1] "environment" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70926341 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- yes it should be `Date` not `date` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70923795 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- Not really - as I mentioned the getSQLDatatype looks at the schema - the method which looks at the R objects is in https://github.com/apache/spark/blob/2e4075e2ece9574100c79558cab054485e25c2ee/R/pkg/R/serialize.R#L84 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70923645 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- Sounds good. for the mapping between: 'POSIXct / POSIXlt' to 'timestamp' and 'Date' to 'date' do we need to update 'getSQLDataType' method ? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70922863 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- And as you mentioned above we can also change `date` to `Date` to be more specific. (It would be ideal now that I think to link these R types to the CRAN help page. For example we can link to https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html for `POSIXct / POSIXlt` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70922747 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- We can remove map, struct. For timestamp lets replace the R side of the table with `POSIXct` / `POSIXlt` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70921996 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- Thanks for the explanation, @shivaram ! So, I'll remove map, struct and timestamp and leave the rest as is. Does it sound fine ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70920785 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- Thats a good point - So users can create a schema with `struct` and that is mapping to a corresponding SQL type. But they can't create any R objects that will be parsed as `struct`. The main reason our schema is more flexible than our serialization / deserialization support is that the schema can be used to say read JSON files or JDBC tables etc. For the use case here, where users are returning a `data.frame` from UDF I dont think there is any valid mapping for `struct` from R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70920518 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- @shivaram, I've looked at the following list: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92 It is being called for creating schema's field and it has map, struct, timestamp, etc ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70920244 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- @felixcheung, I think according to the following mapping we expect 'date': https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91 And it seems that there is a 'Date' in base. Do I understand correct ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70905195 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- I don't think `date` is a type either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70846132 --- Diff: docs/sparkr.md --- @@ -316,6 +314,139 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R +and Spark. + + Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + timestamp + timestamp + + + date + date + + + array + array + + + list + array + + + map + map + + + env + map + + + struct --- End diff -- I dont think R has any notion of a `struct` or `map` data type ? Looking at the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I think we should remove the struct -> struct and map -> map entries. Also I dont think there is a `timestamp` class in R. We should probably replace that with `POSIXct` or `POSIXlt`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70711218 --- Diff: docs/sparkr.md --- @@ -312,7 +310,82 @@ head(ldf, 3) Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to that key. The groups are chosen from `SparkDataFrame`s column(s). The output of function should be a `data.frame`. Schema specifies the row format of the resulting -`SparkDataFrame`. It must match the R function's output. +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R --- End diff -- `Bellow` should be `Below`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70711263 --- Diff: docs/sparkr.md --- @@ -312,7 +310,82 @@ head(ldf, 3) Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to that key. The groups are chosen from `SparkDataFrame`s column(s). The output of function should be a `data.frame`. Schema specifies the row format of the resulting -`SparkDataFrame`. It must match the R function's output. +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R --- End diff -- same, `output field` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r7076 --- Diff: docs/sparkr.md --- @@ -263,7 +263,7 @@ In SparkR, we support several kinds of User-Defined Functions: # dapply Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function -should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output. +should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match to [data types of R function's output fields](#data-type-mapping-between-r-and-spark). --- End diff -- `output fields` --> `return values` or `return value`? http://adv-r.had.co.nz/Functions.html#return-values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70346974 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- I think those mappings are only used to print things in `str`. A better list to consult would be the list at https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that says `list` in R should become a `array` in SparkSQL and `env` in R should map to a `map` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70202736 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- Thanks, I was looking at types.R file and have noticed that we have NA's for array, map and struct. https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42 But I guess in our case we can have: array, map and struct mapped to array, map and struct correspondingly ?! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70202560 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- This looks good to me ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70202321 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- Thanks @shivaram. Does the following mapping looks fine to have in the table ? ``` **R Spark** byte byte integer integer float float double double numericdouble character string stringstring binary binary raw binary logical boolean timestamptimestamp date date array array map map structstruct ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70202064 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- Yeah but instead of a pointer to the code it would be great if we could have a table in the documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70198331 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- or we could probably refer also to this ? https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L21 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70194370 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- I see. I think we can describe the following type mapping in the programming guide. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91 Those are the types used in the StructType's fields. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70172206 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- I think gapply and dapply are the first important use cases where we require strict mapping Spark JVM types to R atomic types. It might be worthwhile to add a section in the programming guide to illustrate and explain that further. To be more concrete, what should be the column type of the UDF output R data.frame if the SparkDataFrame has a column of double? It would be good to have a table on that. That could be a separate PR though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user NarineK commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r70168781 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- Thanks @felixcheung, Does this sound better ? "It must reflect R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user." I could also bring up some examples. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r7362 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- I suppose this could be explained in `dapply` above as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14090#discussion_r69955401 --- Diff: docs/sparkr.md --- @@ -306,6 +306,64 @@ head(ldf, 3) {% endhighlight %} + Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect` + +# gapply +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to +that key. The groups are chosen from `SparkDataFrame`s column(s). +The output of function should be a `data.frame`. Schema specifies the row format of the resulting +`SparkDataFrame`. It must match the R function's output. --- End diff -- it was hard to do in roxygen2 doc but the programming guide would be a great please to touch on or refer to what "match" means exactly - type mapping between Spark and R is a bit fuzzy and would be good to explain a bit more on that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
GitHub user NarineK opened a pull request: https://github.com/apache/spark/pull/14090 [SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used faithful dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R You can merge this pull request into a Git repository by running: $ git pull https://github.com/NarineK/spark gapplyProgGuide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14090 commit 29d8a5c6c22202cdf7d6cc44f1d6cbeca5946918 Author: Narine Kokhlikyan Date: 2016-06-20T22:12:11Z Fixed duplicated documentation problem + separated documentation for dapply and dapplyCollect commit 698c4331d2a8bfe7f4b372ebc8123b6c27a57e68 Author: Narine Kokhlikyan Date: 2016-06-23T18:51:48Z merge with master commit 85a4493a03b3601a93c25ebc1eafb2868efec8d8 Author: Narine Kokhlikyan Date: 2016-07-07T13:18:49Z Adding programming guide for gapply/gapplyCollect commit 7781d1c111f38e3608d5ebd468e6d344d52efa5c Author: Narine Kokhlikyan Date: 2016-07-07T13:27:35Z removing output format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org