[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11569 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214907846 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214907849 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57043/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214907828 LGTM. Merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214907761 **[Test build #57043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57043/consoleFull)** for PR 11569 at commit [`838c915`](https://github.com/apache/spark/commit/838c9155839bbb7fd4d5f855a9d88ae68fef2ffb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214904802 **[Test build #57043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57043/consoleFull)** for PR 11569 at commit [`838c915`](https://github.com/apache/spark/commit/838c9155839bbb7fd4d5f855a9d88ae68fef2ffb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214903911 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214903894 **[Test build #57038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57038/consoleFull)** for PR 11569 at commit [`fc2f6a3`](https://github.com/apache/spark/commit/fc2f6a31166ac895b5c2ce05074f5c7edf372706). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214903913 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57038/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214900440 **[Test build #57038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57038/consoleFull)** for PR 11569 at commit [`fc2f6a3`](https://github.com/apache/spark/commit/fc2f6a31166ac895b5c2ce05074f5c7edf372706). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214898708 LGTM. Thanks @olarayej - I just had a couple of minor comments about using `SparkDataFrame`. Other than that this looks good to merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61170989 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,126 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family SparkDataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histStats <- histogram(irisDF, irisDF$Sepal_Length, nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) + +#' geom_bar(stat = "identity") + +#' xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "SparkDataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") --- End diff -- `DataFrame` -> `SparkDataFrame` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61170976 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,126 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family SparkDataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset --- End diff -- As @felixcheung mentioned before `DataFrame`-> `SparkDataFrame` ? Or we can just delete this comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214898243 @shivaram @felixcheung I have addressed all your comments. Anything else? Or shall we merge? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214894940 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214894944 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57029/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214894669 **[Test build #57029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57029/consoleFull)** for PR 11569 at commit [`e9dbc5b`](https://github.com/apache/spark/commit/e9dbc5b27c258777a539723e0ad4676db928736b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61168496 --- Diff: R/pkg/R/DataFrame.R --- @@ -2465,6 +2465,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + getColumn(df, colname) + +} else if (class(col) == "Column") { + + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col --- End diff -- @shivaram Yes, I have fixed this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214890845 **[Test build #57029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57029/consoleFull)** for PR 11569 at commit [`e9dbc5b`](https://github.com/apache/spark/commit/e9dbc5b27c258777a539723e0ad4676db928736b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214890244 @shivaram @felixcheung Looks like the version of lint-r running on the build server is different than the one on Spark's Github. Even though lint-r passes on my local, I keep getting this errors: R/DataFrame.R:2542:40: style: Put spaces around all infix operators. collapse="" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214888704 **[Test build #57028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57028/consoleFull)** for PR 11569 at commit [`cd7ba4c`](https://github.com/apache/spark/commit/cd7ba4c3af26beba4ac4c0f09ea6f3560069d5a4). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214888718 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57028/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214888714 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214887742 **[Test build #57028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57028/consoleFull)** for PR 11569 at commit [`cd7ba4c`](https://github.com/apache/spark/commit/cd7ba4c3af26beba4ac4c0f09ea6f3560069d5a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61009021 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), --- End diff -- Yeah, let me deliver that with Shivaram's fix as well. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61008558 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset --- End diff -- DataFrame -> SparkDataFrame, or just omit this comment line... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61008598 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), --- End diff -- `"DataFrame"` -> `"SparkDataFrame"` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61008503 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions --- End diff -- this has changed as well `@family SparkDataFrame functions` sorry this is such a moving target --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214539753 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56927/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214539717 **[Test build #56927 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56927/consoleFull)** for PR 11569 at commit [`976e412`](https://github.com/apache/spark/commit/976e412e7cdcbee95164f05eaf088e5ec7b08160). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214539752 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214537526 **[Test build #56927 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56927/consoleFull)** for PR 11569 at commit [`976e412`](https://github.com/apache/spark/commit/976e412e7cdcbee95164f05eaf088e5ec7b08160). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214528787 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214528791 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56924/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214528756 **[Test build #56924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56924/consoleFull)** for PR 11569 at commit [`7cdb9e8`](https://github.com/apache/spark/commit/7cdb9e83ee4a7223347a7d10eac9ab2a3ce51ca7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214528780 **[Test build #56924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56924/consoleFull)** for PR 11569 at commit [`7cdb9e8`](https://github.com/apache/spark/commit/7cdb9e83ee4a7223347a7d10eac9ab2a3ce51ca7). * This patch **fails some tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213854068 please rebase to pick up `DataFrame` -> `SparkDataFrame` class name change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60797789 --- Diff: R/pkg/R/DataFrame.R --- @@ -2465,6 +2465,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + getColumn(df, colname) + +} else if (class(col) == "Column") { + + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col --- End diff -- Do we need to check if `x` is a column name already present in the data frame ? For example I ran the code ``` irisDF$x <- irisDF$Petal_Width + 2.0 histogram(irisDF, irisDF$x, 8) ``` and I got an error ``` org.apache.spark.sql.AnalysisException: resolved attribute(s) x#141 missing from Species#4,Sepal_Length#0,x#269,Petal_Width#3,Petal_Length#2,Sepal_Width#1 in operator !Project [Sepal_Length#0,Sepal_Width#1,Petal_Length#2,Petal_Width#3,Species#4,x#269,castcast(castx#141 - 2.1) / 2.4) * 1.0) as int) as double) / 1.0) / 0.125) - CASE WHEN cast(castx#141 - 2.1) / 2.4) * 1.0) as int) as double) / 1.0) / 0.125) = cast(cast(((cast(castx#141 - 2.1) / 2.4) * 1.0) as int) as double) / 1.0) / 0.125) as int) as double)) && NOT (x#141 = 2.1)) THEN 1.0 ELSE 0.0 END) as int) AS bins#325] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213528826 @felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213528609 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213528509 **[Test build #56711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56711/consoleFull)** for PR 11569 at commit [`fc4c536`](https://github.com/apache/spark/commit/fc4c536ca55fe4beefc27139dad03093cff7194e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213528612 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56711/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213523010 **[Test build #56711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56711/consoleFull)** for PR 11569 at commit [`fc4c536`](https://github.com/apache/spark/commit/fc4c536ca55fe4beefc27139dad03093cff7194e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213188379 looks good except 1 minor doc comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60675031 --- Diff: R/pkg/R/DataFrame.R --- @@ -2465,6 +2465,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs --- End diff -- `#' @family DataFrame functions` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213124256 @felixcheung @shivaram I have addressed all your comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213117725 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213117603 **[Test build #56589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56589/consoleFull)** for PR 11569 at commit [`3e19fe8`](https://github.com/apache/spark/commit/3e19fe889c2c9709783503cd1082f4ab7ba6d37c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213117728 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56589/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213113387 **[Test build #56589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56589/consoleFull)** for PR 11569 at commit [`3e19fe8`](https://github.com/apache/spark/commit/3e19fe889c2c9709783503cd1082f4ab7ba6d37c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213113112 **[Test build #56588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56588/consoleFull)** for PR 11569 at commit [`c2c4601`](https://github.com/apache/spark/commit/c2c4601b1a09e23e5b1be64fb027b92f9638da20). * This patch **fails R style tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213113120 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213113123 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56588/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213111260 **[Test build #56588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56588/consoleFull)** for PR 11569 at commit [`c2c4601`](https://github.com/apache/spark/commit/c2c4601b1a09e23e5b1be64fb027b92f9638da20). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60651721 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,107 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + getColumn(df, colname) +} else if (class(col) == "Column") { + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col + + # Filter NA values in the target column. Cannot remove all other columns + # since given Column may be an expression on one or more existing columns + df <- na.omit(df) + + colname <- "x" + col +} + +# At this point, df only has one column: the one to compute the histogram from +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 --- End diff -- Yeah, let me move it to DataFrame.R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60651656 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,107 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + getColumn(df, colname) +} else if (class(col) == "Column") { + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col + + # Filter NA values in the target column. Cannot remove all other columns + # since given Column may be an expression on one or more existing columns + df <- na.omit(df) + + colname <- "x" + col +} + +# At this point, df only has one column: the one to compute the histogram from +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 --- End diff -- @felixcheung That would never be the case since I'm normalizing the data to be within [0, 1]. Line 2715: ` xnorm <- (x - min) / (max - min)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60452619 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- @felixcheung I need to remove NA values from Column `x` too, since `x` could be an arbitrary Column expression. Therefore, the `na.omit() `invocation should go afterwards --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60319306 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,107 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + getColumn(df, colname) +} else if (class(col) == "Column") { + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col + + # Filter NA values in the target column. Cannot remove all other columns + # since given Column may be an expression on one or more existing columns + df <- na.omit(df) + + colname <- "x" + col +} + +# At this point, df only has one column: the one to compute the histogram from +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 --- End diff -- would this truncate the value if xnorm was close to 2*10^9 before * 1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60318933 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- perhaps swap these two lines? ``` + df$x <- col + df <- na.omit(df) ``` to ``` + df <- na.omit(df) + df$x <- col ``` that would make it more clear? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212145097 thanks, as I have commented [earlier](https://github.com/apache/spark/pull/11569#issuecomment-200947392), I suspect this is better belong to DataFrame.R instead, since this is currently a function on DataFrame. I think that would be more discoverable/maintainable. When #11336 is resolved we could merge/update this to work by Column only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212141260 @shivaram @felixcheung I have addressed all your comments. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212139757 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56269/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212139755 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212139610 **[Test build #56269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56269/consoleFull)** for PR 11569 at commit [`b03c335`](https://github.com/apache/spark/commit/b03c335a6dc9818f54fc2633fb149f9f3ad0277d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212135413 **[Test build #56269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56269/consoleFull)** for PR 11569 at commit [`b03c335`](https://github.com/apache/spark/commit/b03c335a6dc9818f54fc2633fb149f9f3ad0277d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212133278 **[Test build #56267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56267/consoleFull)** for PR 11569 at commit [`046b7da`](https://github.com/apache/spark/commit/046b7dad841bbf13d1d4b93bf001474f74b25865). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212133286 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212133290 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56267/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212132210 **[Test build #56267 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56267/consoleFull)** for PR 11569 at commit [`046b7da`](https://github.com/apache/spark/commit/046b7dad841bbf13d1d4b93bf001474f74b25865). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60287324 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- @shivaram @felixcheung The original DataFrame is NOT being mutated. As a matter of fact, R doesn't support passing parameters by reference, so effectively, a new copy of the DataFrame is being created every time this function is being invoked. This should, in turn, trigger the creation of a new corresponding Java object. To illustrate this, notice the dataset doesn't change after running histogram(): ``` > str(irisDF) 'DataFrame': 5 variables: $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4 $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9 $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7 $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4 $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" > histogram(irisDF, irisDF$Sepal_Length + 1) bins counts centroids 1 0 9 5.48 2 1 23 5.84 3 2 14 6.20 4 3 27 6.56 5 4 22 6.92 6 5 20 7.28 7 6 18 7.64 8 7 6 8.00 9 8 5 8.36 109 6 8.72 > str(irisDF)
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212057016 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212057021 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56245/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212056674 **[Test build #56245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56245/consoleFull)** for PR 11569 at commit [`adc3446`](https://github.com/apache/spark/commit/adc34461a869d4b4c072952b999c896047e994d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60284165 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) --- End diff -- Yes. Let me fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212051119 **[Test build #56245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56245/consoleFull)** for PR 11569 at commit [`adc3446`](https://github.com/apache/spark/commit/adc34461a869d4b4c072952b999c896047e994d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60281798 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- Perhaps create a new intermediate DataFrame instead of mutating the input DataFrame? The return value of this function is a local/native R data.frame anyway --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212049730 @felixcheung Could you also take a look at this PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60281327 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- Similar question as above. I'm wondering if there is a better way than adding `bins` as a column to the input DF. Ideally, as a user I would assume that `histogram` is a safe function in that it doesn't mutate the input data given to it. I am not sure whats an easy solution here though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60280663 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col --- End diff -- In that case can we check if this column is present in the dataframe and not add it if it is present ? I just don't want to keep adding spurious columns if somebody keeps calling `histogram(df$average)` repeatedly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60278988 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col --- End diff -- @shivaram That's because the user could do: `histogram(irisDF, irisDF$Sepal_Length + 1, nbins=12)` In that case, the given Column doesn't belong to the DataFrame. This gives the user a lot of flexibility and R-like feel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60278119 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) --- End diff -- If we are filtering NA values, we should also do it for the other case ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60278057 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col --- End diff -- I'm not sure why we need to add the column to the dataframe ? Isn't it already a part of the dataframe ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60277269 --- Diff: R/pkg/DESCRIPTION --- @@ -36,3 +36,4 @@ Collate: 'stats.R' 'types.R' 'utils.R' +RoxygenNote: 5.0.1 --- End diff -- I think this might have been auto-generated by roxygen - Can we revert this file for this PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-209030552 @felixcheung @shivaram This is done. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-203678252 Looks good to you @felixcheung? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201607130 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201607135 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54237/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201606853 **[Test build #54237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54237/consoleFull)** for PR 11569 at commit [`2800492`](https://github.com/apache/spark/commit/2800492e307253d7f6004944b2e4beb11f76c330). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201597000 @felixcheung I just added support for Columns or characters. In my opinion, histogram() is column function and when we sort out #11336, it wouldn't have to take a DataFrame as a parameter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201596039 **[Test build #54237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54237/consoleFull)** for PR 11569 at commit [`2800492`](https://github.com/apache/spark/commit/2800492e307253d7f6004944b2e4beb11f76c330). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201594759 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54234/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201594744 **[Test build #54234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54234/consoleFull)** for PR 11569 at commit [`19f995c`](https://github.com/apache/spark/commit/19f995c8f72efb58b818f81a122529680c24f5ec). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201594755 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201593441 **[Test build #54234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54234/consoleFull)** for PR 11569 at commit [`19f995c`](https://github.com/apache/spark/commit/19f995c8f72efb58b818f81a122529680c24f5ec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-200947392 btw, is this still the right place? functions.R works with Column, if this works with DataFrame, should it go to DataFrame.R? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-200928627 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54051/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-200928622 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-200928468 **[Test build #54051 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54051/consoleFull)** for PR 11569 at commit [`dbc9d75`](https://github.com/apache/spark/commit/dbc9d75584b7ae3ab952c7a88b19b21ad00a5a82). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57349827 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-200919208 **[Test build #54051 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54051/consoleFull)** for PR 11569 at commit [`dbc9d75`](https://github.com/apache/spark/commit/dbc9d75584b7ae3ab952c7a88b19b21ad00a5a82). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org