[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @shivaram: Have you reviewed this? If the intent is to merge it, I'll gladly update the code. @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 Happy New Year, folks! Any updates on this? @shivaram @falaki --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki @shivaram Shall we merge this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @shivaram @felixcheung @falaki Any thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 Folks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @shivaram @falaki @felixcheung Any additional comments? Otherwise, are we ready to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @felixcheung I'm done! Thanks for your comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @felixcheung When a user types a variable name on the R shell, it triggers method showDefault() which, in turn, invokes show(). I wrote an implementation of show() for Column which, in turn, invokes head() (not collect), showing the first 20 elements of the dataset. This mimics R behavior and I think it also helps with usability. However, if the agreement is not to have that, I can just remove show method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @felixcheung @falaki Folks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @felixcheung @falaki I have addressed all your comments. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] head() and show() for Colum...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r84150590 --- Diff: R/pkg/R/DataFrame.R --- @@ -3321,3 +3328,11 @@ setMethod("randomSplit", } sapply(sdfs, dataFrame) }) + +# A global singleton for an empty SparkR DataFrame. +getEmptySparkRDataFrame <- function() { + if (is.null(.sparkREnv$EMPTY_DF)) { +.sparkREnv$EMPTY_DF <- as.DataFrame(data.frame(0)) + } + return(.sparkREnv$EMPTY_DF) --- End diff -- get() would throw an error if the variable is not defined. I'll use exists() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @felixcheung @falaki I have addressed all your comments and tests pass now. Thank you! cc @aloknsingh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki @felixcheung I have addressed all your comments. I'm getting two documentation warnings which seem to be making the build fail: 1) ``` Undocumented S4 methods: generic 'head' and siglist 'Column' generic 'show' and siglist 'Column' ``` The documentation for these is on Data.Frame.R. I don't see a need for duplicating the docs in column.R 2) ``` Undocumented arguments in documentation object 'Column-class' 'df' Undocumented arguments in documentation object 'head' '...' ``` I do have documentation for slot df in class Column. And also, I don't have ... as part of the signature of method head. Not sure why this warning comes up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r83057347 --- Diff: R/pkg/R/functions.R --- @@ -2836,7 +2845,11 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), setMethod("rand", signature(seed = "missing"), function(seed) { jc <- callJStatic("org.apache.spark.sql.functions", "rand") -column(jc) + +# By assigning a one-row data.frame, the result of this function can be collected +# returning a one-element Column +df <- as.DataFrame(sparkRSQL.init(), data.frame(0)) --- End diff -- @felixcheung That's a good idea. I have created a singleton accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r82919122 --- Diff: R/pkg/R/DataFrame.R --- @@ -1168,12 +1179,14 @@ setMethod("take", #' Head #' -#' Return the first \code{num} rows of a SparkDataFrame as a R data.frame. If \code{num} is not -#' specified, then head() returns the first 6 rows as with R data.frame. +#' Return the first elements of a dataset. If \code{x} is a SparkDataFrame, its first +#' rows will be returned as a data.frame. If the dataset is a \code{Column}, its first +#' elements will be returned as a vector. The number of elements to be returned +#' is given by parameter \code{num}. Default value for \code{num} is 6. #' -#' @param x a SparkDataFrame. -#' @param num the number of rows to return. Default is 6. -#' @return A data.frame. +#' @param x A SparkDataFrame or Column --- End diff -- Not sure I follow here. Could you point to the specific example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r82918200 --- Diff: R/pkg/R/column.R --- @@ -32,35 +34,57 @@ setOldClass("jobj") #' @export #' @note Column since 1.4.0 setClass("Column", - slots = list(jc = "jobj")) + slots = list(jc = "jobj", df = "SparkDataFrameOrNull")) #' A set of operations working with SparkDataFrame columns #' @rdname columnfunctions #' @name columnfunctions NULL - -setMethod("initialize", "Column", function(.Object, jc) { +setMethod("initialize", "Column", function(.Object, jc, df) { .Object@jc <- jc + + # Some Column objects don't have any referencing DataFrame. In such case, df will be NULL. + if (missing(df)) { +df <- NULL + } + .Object@df <- df .Object }) +setMethod("show", signature = "Column", definition = function(object) { --- End diff -- Sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r82911261 --- Diff: R/pkg/R/DataFrame.R --- @@ -1049,11 +1055,16 @@ setMethod("dim", #' @export #' @examples #'\dontrun{ -#' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) -#' collected <- collect(df) -#' firstName <- collected[[1]]$name +#' # Initialize Spark context and SQL context +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) --- End diff -- Sure. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r82911244 --- Diff: R/pkg/R/DataFrame.R --- @@ -1035,10 +1035,16 @@ setMethod("dim", c(count(x), ncol(x)) }) -#' Collects all the elements of a SparkDataFrame and coerces them into an R data.frame. +#' Download Spark datasets into R --- End diff -- Sure. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11336#discussion_r82910308 --- Diff: R/pkg/R/functions.R --- @@ -2836,7 +2845,11 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), setMethod("rand", signature(seed = "missing"), function(seed) { jc <- callJStatic("org.apache.spark.sql.functions", "rand") -column(jc) + +# By assigning a one-row data.frame, the result of this function can be collected +# returning a one-element Column +df <- as.DataFrame(sparkRSQL.init(), data.frame(0)) --- End diff -- See my comment from March 30 to illustrate why this is needed. I'll change sparkRSQL.init() to sparkR.session(). Thanks for catching this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki Yeah but those warnings are making the build fail (see below). Is that okay? Now I see a new"Checks" section. I may be outdated with the protocols as it's been a while I didn't commit :-). Thanks! ``` Failed - 1. Error: column binary mathfunctions (@test_sparkSQL.R#1256) -- error in evaluating the argument 'x' in selecting a method for function 'collect': error in evaluating the argument 'col' in selecting a method for function 'select': (converted from warning) 'sparkRSQL.init' is deprecated. Use 'sparkR.session' instead. See help("Deprecated") 1: expect_equal(class(collect(select(df, rand()))[2, 1]), "numeric") at /home/jenkins/workspace/SparkPullRequestBuilder/R/lib/SparkR/tests/testthat/test_sparkSQL.R:1256 2: compare(object, expected, ...) 3: collect(select(df, rand())) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki Thanks for your comments. Yeah, before removing collect/show, I just wanted to rebase to current upstream master. I'm getting a build error which is actually a warning, not even an R error: ``` (converted from warning) 'sparkRSQL.init' is deprecated. Use 'sparkR.session' instead. ``` I don't explicitly use sparkRSQL.init anywhere in my code so I'm investigating. If you have any suggestion, it'd be nice. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki Sorry, I was out of town. Let me get back to this today. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user olarayej commented on the issue: https://github.com/apache/spark/pull/11336 @falaki Absolutely. Let me do that. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-217586684 I just got this branch up to date. Any comments, folks? @shivaram @falaki @felixcheung @rxin @sun-rui @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-215253259 @shivaram I've changed default value to drop=F as you suggested. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-215181464 @shivaram @sun-rui @felixcheung This one's ready. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214898243 @shivaram @felixcheung I have addressed all your comments. Anything else? Or shall we merge? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61168496 --- Diff: R/pkg/R/DataFrame.R --- @@ -2465,6 +2465,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + getColumn(df, colname) + +} else if (class(col) == "Column") { + + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col --- End diff -- @shivaram Yes, I have fixed this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-214890244 @shivaram @felixcheung Looks like the version of lint-r running on the build server is different than the one on Spark's Github. Even though lint-r passes on my local, I keep getting this errors: R/DataFrame.R:2542:40: style: Put spaces around all infix operators. collapse="" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r61009021 --- Diff: R/pkg/R/DataFrame.R --- @@ -2469,6 +2469,110 @@ setMethod("drop", base::drop(x) }) +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family DataFrame functions +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), --- End diff -- Yeah, let me deliver that with Shivaram's fix as well. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213528826 @felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-213124256 @felixcheung @shivaram I have addressed all your comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60651721 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,107 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + getColumn(df, colname) +} else if (class(col) == "Column") { + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col + + # Filter NA values in the target column. Cannot remove all other columns + # since given Column may be an expression on one or more existing columns + df <- na.omit(df) + + colname <- "x" + col +} + +# At this point, df only has one column: the one to compute the histogram from +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 --- End diff -- Yeah, let me move it to DataFrame.R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60651656 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,107 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column and remove all other columns + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + getColumn(df, colname) +} else if (class(col) == "Column") { + # Append the given column to the dataset. This is to support Columns that + # don't belong to the DataFrame but are rather expressions + df$x <- col + + # Filter NA values in the target column. Cannot remove all other columns + # since given Column may be an expression on one or more existing columns + df <- na.omit(df) + + colname <- "x" + col +} + +# At this point, df only has one column: the one to compute the histogram from +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 --- End diff -- @felixcheung That would never be the case since I'm normalizing the data to be within [0, 1]. Line 2715: ` xnorm <- (x - min) / (max - min)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60452619 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- @felixcheung I need to remove NA values from Column `x` too, since `x` could be an arbitrary Column expression. Therefore, the `na.omit() `invocation should go afterwards --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-212141260 @shivaram @felixcheung I have addressed all your comments. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60287324 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col + colname <- "x" + col +} + +stats <- collect(describe(df[, colname])) +min <- as.numeric(stats[4, 2]) +max <- as.numeric(stats[5, 2]) + +# Normalize the data +xnorm <- (x - min) / (max - min) + +# Round the data to 4 significant digits. This is to avoid rounding issues. +xnorm <- cast(xnorm * 1, "integer") / 1.0 + +# Since min = 0, max = 1 (data is already normalized) +normBinSize <- 1 / nbins +binsize <- (max - min) / nbins +approxBins <- xnorm / normBinSize + +# Adjust values that are equal to the upper bound of each bin +bins <- cast(approxBins - + ifelse(approxBins == cast(approxBins, "integer") & x != min, 1, 0), + "integer") + +df$bins <- bins --- End diff -- @shivaram @felixcheung The original DataFrame is NOT being mutated. As a matter of fact, R doesn't support passing parameters by reference, so effectively, a new copy of the DataFrame is being created every time this function is being invoked. This should, in turn, trigger the creation of a new corresponding Java object. To illustrate this, notice the dataset doesn't change after running histogram(): ``` > str(irisDF) 'DataFrame': 5 variables: $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4 $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9 $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7 $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4 $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" > histogram(irisDF, irisDF$Sepal_Lengt
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60284165 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) --- End diff -- Yes. Let me fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r60278988 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,100 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). Default value is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @rdname histogram +#' @family agg_funcs +#' @export +#' @examples +#' \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, the histogram can be +#' # rendered using the ggplot2 library: +#' +#' require(ggplot2) +#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +#' plot <- plot + geom_histogram(data = histStats, stat = "identity", binwidth = 100) +#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency") +#' } +setMethod("histogram", + signature(df = "DataFrame", col = "characterOrColumn"), + function(df, col, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") +} + +# Round nbins to the smallest integer +nbins <- floor(nbins) + +# Validate col +if (is.null(col)) { + stop("col must be specified.") +} + +colname <- col +x <- if (class(col) == "character") { + if (!colname %in% names(df)) { +stop("Specified colname does not belong to the given DataFrame.") + } + + # Filter NA values in the target column + df <- na.omit(df[, colname]) + + # TODO: This will be when improved SPARK-9325 or SPARK-13436 are fixed + eval(parse(text = paste0("df$", colname))) +} else if (class(col) == "Column") { + # Append the given column to the dataset + df$x <- col --- End diff -- @shivaram That's because the user could do: `histogram(irisDF, irisDF$Sepal_Length + 1, nbins=12)` In that case, the given Column doesn't belong to the DataFrame. This gives the user a lot of flexibility and R-like feel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-209030552 @felixcheung @shivaram This is done. Shall we merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-209030229 @shivaram @falaki @felixcheung @rxin @sun-rui Any thoughts on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-203678252 Looks good to you @felixcheung? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-203678123 @sun-rui @shivaram Shall we merge this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-203672948 Thanks @sun-rui @rxin @shivaram for your inputs. To alleviate the confusion on which columns can/cannot be collected, I propose the following (already pushed the code): Currently there are 15 SparkR functions that return an âorphanâ Column with no parent DataFrame: ``` rand, rand, unix_timestamp, struct, expr, column, lag, lead, lit, cume_dist, dense_rank, ntile, percent_rank, rank, row_number ``` The first three (i.e., rand, randn, and unix_timestamp) can be nicely collected as single elements. For example: ``` > rand() [1] 0.01483325 ``` The remaining ones donât make sense unless thereâs an associated DataFrame. Therefore, an empty vector will be returned: ``` > column("Species") Species > collect(column("Species")) character(0) ``` I think it makes sense: If you donât associate a Column with a DataFrame, thereâs nothing to be collected. Now, for Columns that do belong to a DataFrame, collecting columns SIGNIFICANTLY improves usability in 138 functions/operators (besides other issues in the design document), for example: > irisDF$Sepal_Length * 100 [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 510 570 510⦠versus: > head(select(irisDF, irisDF$Sepal_Length * 100), 20)[, 1] [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 510 570 510 @shivaram has a very valid point: this introduces discrepancies in the Spark APIâs across multiple languages. I believe this is not necessarily bad as R, especially, is a slightly different animal which already has a specific behavior for columns (i.e., vectors). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-201597000 @felixcheung I just added support for Columns or characters. In my opinion, histogram() is column function and when we sort out #11336, it wouldn't have to take a DataFrame as a parameter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57349827 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57343915 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { --- End diff -- Yeah, but then what if the user wants to do: `hist(irisDF, irisDF$Sepal_Length + 1)` describe() would fail as this column doesn't belong to df. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57342255 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") --- End diff -- R doesn't (see below). Let me change my code so it rounds nbins to the lowest integer. > str(hist(iris$Sepal.Length, breaks=10.5)) List of 6 $ breaks : num [1:9] 4 4.5 5 5.5 6 6.5 7 7.5 8 $ counts : int [1:8] 5 27 27 30 31 18 6 6 $ density : num [1:8] 0.0667 0.36 0.36 0.4 0.4133 ... $ mids: num [1:8] 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75 $ xname : chr "iris$Sepal.Length" $ equidist: logi TRUE --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57259854 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { --- End diff -- @felixcheung Yeah, I thought of that but I don't know how to compute the min and max in one single pass given a Column (not a name). I used describe() which requires a column name. I also tried agg, but it cannot compute more than 1 stat per column: ``` > collect(agg(irisDF, Sepal_Length="max", Sepal_Width="min")) max(Sepal_Length) min(Sepal_Width) 1 7.92 > collect(agg(irisDF, Sepal_Length="max", Sepal_Length="min")) max(Sepal_Length) 1 7.9 Suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11569#discussion_r57252393 --- Diff: R/pkg/R/functions.R --- @@ -2638,3 +2638,81 @@ setMethod("sort_array", jc <- callJStatic("org.apache.spark.sql.functions", "sort_array", x@jc, asc) column(jc) }) + +#' This function computes a histogram for a given SparkR Column. +#' +#' @name histogram +#' @title Histogram +#' @param nbins the number of bins (optional). The default is 10. +#' @param df the DataFrame containing the Column to build the histogram from. +#' @param colname the name of the column to build the histogram from. +#' @return a data.frame with the histogram statistics, i.e., counts and centroids. +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Compute histogram statistics +#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12) +#' +#' # Once SparkR has computed the histogram statistics, it would be very easy to +#' # render the histogram using R's visualization packages such as ggplot2. +#' +#' } +setMethod("histogram", + signature(df = "DataFrame"), + function(df, colname, nbins = 10) { +# Validate nbins +if (nbins < 2) { + stop("The number of bins must be a positive integer number greater than 1.") --- End diff -- Yup, you could have a histogram with 2 bins: > histogram(irisDF, "Sepal_Length", nbins=2) bins counts centroids 10 95 5.2 21 55 7.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11569#issuecomment-199552264 @shivaram @sun-rui @felixcheung Yeah, that makes sense. I modified histogram() function so now it only computes the histogram statistics. There's neither rendering nor dependency on ggplot2 anymore. I think the histogram stats are still very useful for an R user and if they wanna plot it later, they're free to use any of R packages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-199464512 @sun-rui Done with the style issues. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11318#discussion_r56418618 --- Diff: R/pkg/R/DataFrame.R --- @@ -1217,29 +1217,38 @@ setMethod("[[", signature(x = "DataFrame", i = "numericOrcharacter"), #' @rdname subset --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-197566393 @felixcheung @shivaram @sun-rui I have addressed all your comments. Do we have a consensus on the default value for drop? I'd say drop=T makes sense cuz R does it that way. Anyways, it's just a one-line change for me. Please let me know! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11220#issuecomment-195127575 Thanks, @shivaram! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r55481841 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), --- End diff -- Thanks @felixcheung for investigating this further! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-194128322 @felixcheung @sun-rui @shivaram Can you folks please take a look at this one? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r55418947 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), + signature(x = "DataFrame"), function(x, value) { + +# Check parameter integrity +if (class(value) != "character") { + stop("Invalid column names.") +} + +if (length(value) != ncol(x)) { + stop( +"Column names must have the same length as the number of columns in the dataset.") +} + +if (any(is.na(value))) { + stop("Column names cannot be NA.") +} + +# Check if the column names have . in it +if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) { --- End diff -- @sun-rui @felixcheung @shivaram Folks: this is a really simple thing. Shall we merge it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Added histogram function
GitHub user olarayej opened a pull request: https://github.com/apache/spark/pull/11569 Added histogram function ## What changes were proposed in this pull request? Added method histogram() to compute the histogram of a Column **Usage:** # Create a DataFrame from the Irijjs dataset irisDF <- createDataFrame(sqlContext, iris) # Render a histogram for the Sepal_Length column histogram(irisDF, "Sepal_Length", nbins=12) Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name ## How was this patch tested? All unit tests pass. I added specific unit cases for different scenarios. You can merge this pull request into a Git repository by running: $ git pull https://github.com/olarayej/spark SPARK-13734 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11569.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11569 commit efc2f6634b54cd91e4946d4d4e04be769769f4ad Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com> Date: 2016-03-08T00:19:07Z Added histogram function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-192400350 @sun-rui Does that make sense to you? @shivaram @felixcheung Any comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-191399711 Also, the fact that the size of a column depends on the join seems counter-intuitive for an R user: ``` > dim(irisDF2) [1] 150 5 > dim(irisDF) [1] 150 5 > x <- irisDF$Sepal_Length + irisDF2$Sepal_Length ``` In R, x will always have 150 elements. However: ``` # Cartesian product > df3 <- join(irisDF, irisDF2) > dim(select(df3, x)) [1] 22500 1 # Inner join by Species > df4 <- merge(irisDF, irisDF2, by="Species") > dim(select(df4, x)) [1] 75001 ``` I still think SparkR shouldn't allow operations between columns coming from different DataFrames. And, in the case of a join, operations can be performed on the joined DataFrame (e.g., df3) as opposed to the original ones (e.g., irisDF and irisDF2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-191003387 @sun-rui Yes. In that case, c3 will be only associated to df3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-190867380 SparkR doesn't support operations between columns from different DataFrame objects. Yet you can do: ``` c1 <- df1$c1 c2 <- df2$c2 c3 < - c1 + c2 ``` c3 can't be used at all. See examples below: ``` ## Create two DataFrames from Iris > irisDF <- createDataFrame(sqlContext, iris) > irisDF2 <- createDataFrame(sqlContext, iris) ## Create Column x, adding two Columns in two DataFrame's > x <- irisDF$Sepal_Length + irisDF2$Sepal_Length ## You can't use Column x as a predicate > irisDF[x > 0, ] 16/03/01 11:04:19 ERROR RBackendHandler: filter on 76 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: resolved attribute(s) Sepal_Length#20 missing from Sepal_Length#15,Petal_Width#18,Sepal_Width#16,Petal_Length#17,Species#19 in operator !Filter ((Sepal_Length#15 + Sepal_Length#20) > 0.0); ## You can't select Column x either > select(irisDF, x) 16/03/01 11:04:43 ERROR RBackendHandler: select on 76 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: resolved attribute(s) Sepal_Length#20 missing from Sepal_Length#15,Petal_Width#18,Sepal_Width#16,Petal_Length#17,Species#19 in operator !Project [(Sepal_Length#15 + Sepal_Length#20) AS (Sepal_Length + Sepal_Length)#25]; > select(irisDF2, x) 16/03/01 11:04:45 ERROR RBackendHandler: select on 91 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: resolved attribute(s) Sepal_Length#15 missing from Sepal_Length#20,Sepal_Width#21,Species#24,Petal_Width#23,Petal_Length#22 in operator !Project [(Sepal_Length#15 + Sepal_Length#20) AS (Sepal_Length + Sepal_Length)#26]; ``` In my opinion, we should throw an error if the user is trying to operate on Columns coming from different DataFrames. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-190334385 Can any of you folks please take a look at the code? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-189053702 @felixcheung It wasn't an R issue after all. The problem was that I hadn't been able to rebuild Spark in the last couple days due to SPARK-13431, and I needed changes on SPARK-12799. Now that itâs fixed, everything runs fine on R 3.2.2. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-188484496 Thanks, folks. Looks like all test pass now! :-) However, on my environment (R 3.2.2), two tests don't pass. We should be careful whenever upgrading the R version: ``` 1. Failure (at test_sparkSQL.R#1052): column functions - result not equal to expected Names: 1 string mismatch 2. Failure (at test_sparkSQL.R#1058): column functions - result not equal to expected Names: 1 string mismatch Error: Test failures ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-188138539 @sun-rui @shivaram Do you know which version of R and SparkR's dependencies are being used in Jenkins? Tests run fine in my environment (have reviewed my code and ran unit tests many times). Wondering if that's due to a different version of R, testthat, devtools, rJava, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-188046181 @AmplabJenkins Jenkins, could you retest please? I see ERROR: Error fetching remote repo 'origin' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-187996241 Jenkins, retest please. All tests pass for me after checking out this branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11336#issuecomment-187996142 Thanks @falaki! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...
GitHub user olarayej opened a pull request: https://github.com/apache/spark/pull/11336 [SPARK-9325][SPARK-R] collect() head() and show() for Columns See attached design document [SparkR collect (JIRA doc).pdf](https://github.com/apache/spark/files/143656/SparkR.collect.JIRA.doc.pdf) You can merge this pull request into a Git repository by running: $ git pull https://github.com/olarayej/spark SPARK-9325 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11336.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11336 commit 6fc97e02975909cb72e27077aac97d4f90b332d5 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com> Date: 2016-02-23T23:48:55Z Support for collect() on Columns commit fbf9b02b478b8eb4845232e09932d068cb393fd8 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com> Date: 2016-02-24T00:00:52Z Removed drop=F from other PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR]
GitHub user olarayej opened a pull request: https://github.com/apache/spark/pull/11318 [SPARK-13436][SPARKR] ## What changes were proposed in this pull request? Added parameter drop to subsetting operator [. Refer to R's documentation for the behavior of parameter drop. ## How was the this patch tested? Ran all unit tests. Added drop=F to some of the tests where a DataFrame was required as opposed to a Column. You can merge this pull request into a Git repository by running: $ git pull https://github.com/olarayej/spark SPARK-13436 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11318 commit e8c64156e468105a4323f59d1ece87c8fb6662f4 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com> Date: 2016-02-16T18:14:40Z Added parameter validations for colnames<- commit 07e541b7e55a322ea7c74e230ee897ebe9584197 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net> Date: 2016-02-22T18:16:21Z Added one test for replacing . with _ in column names assignment commit 9fa2f5f13b0e6bb9a7ad9b53fc48c6694e49565a Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net> Date: 2016-02-23T02:54:50Z Added drop parameter to subsetting operator. Rewrote [ as one single method commit 632a81fe89b0f461b8084f6f2048046eb17c04b0 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net> Date: 2016-02-23T02:58:29Z Merge branch 'master' of https://github.com/apache/spark into SPARK-13436 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436] [Mesos] Document that spark.meso...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11311#issuecomment-187425038 @mgummelt I'm guessing you got the wrong PR number? Could you please fix this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r53701739 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), + signature(x = "DataFrame"), function(x, value) { + +# Check parameter integrity +if (class(value) != "character") { + stop("Invalid column names.") +} + +if (length(value) != ncol(x)) { + stop( +"Column names must have the same length as the number of columns in the dataset.") +} + +if (any(is.na(value))) { + stop("Column names cannot be NA.") +} + +# Check if the column names have . in it +if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) { --- End diff -- Done. @sun-rui, @felixcheung. Shall we merge this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11220#issuecomment-187318737 Jenkins, retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r53504008 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), + signature(x = "DataFrame"), function(x, value) { + +# Check parameter integrity +if (class(value) != "character") { + stop("Invalid column names.") +} + +if (length(value) != ncol(x)) { + stop( +"Column names must have the same length as the number of columns in the dataset.") +} + +if (any(is.na(value))) { + stop("Column names cannot be NA.") +} + +# Check if the column names have . in it +if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) { --- End diff -- @felixcheung Not sure I follow your idea. Is this what you refer to? # Note: if this test is broken, remove check for "." character on colnames<- method expect_equal(colnames(irisDF)[1] == "Sepal_Length")) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r53375517 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), --- End diff -- The reason I added this was because of this: > colnames(iris) <- 1 Error in `colnames<-`(`*tmp*`, value = 1) : 'dimnames' applied to non-array After I ran that, I saw this: > showMethods("colnames<-") Function: colnames<- (package SparkR) x="ANY", value="ANY" x="DataFrame", value="character" x="DataFrame", value="numeric" (inherited from: x="ANY", value="ANY") So looks like R automatically adds definitions of colnames<- if value is other than character. This does not happen with coltypes<-, as it's not part of base package and doesn't have an (ANY,ANY) signature. > coltypes(irisDF) <- 1 Error in (function (classes, fdef, mtable) : unable to find an inherited method for function âcoltypes<-â for signature â"DataFrame", "numeric"â --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r53373470 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), + signature(x = "DataFrame"), function(x, value) { + +# Check parameter integrity +if (class(value) != "character") { + stop("Invalid column names.") +} + +if (length(value) != ncol(x)) { + stop( +"Column names must have the same length as the number of columns in the dataset.") +} + +if (any(is.na(value))) { + stop("Column names cannot be NA.") +} + +# Check if the column names have . in it +if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) { --- End diff -- @felixcheung @sun-rui Thanks for your input. Right now if I assign column names containing "." character, any subsequent operation on the DataFrame will fail. Now, regarding @felixcheung's comment on the test case, right now there are two test cases with str() and with() expecting colnames of iris to be "Sepal_Length", ..., etc. Those will be broken when they fix SPARK-11976. No need to add more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
GitHub user olarayej opened a pull request: https://github.com/apache/spark/pull/11220 [SPARK-13327][SPARKR] Added parameter validations for colnames<- You can merge this pull request into a Git repository by running: $ git pull https://github.com/olarayej/spark SPARK-13312-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11220.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11220 commit e8c64156e468105a4323f59d1ece87c8fb6662f4 Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com> Date: 2016-02-16T18:14:40Z Added parameter validations for colnames<- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-172054570 Thanks, @shivaram! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-171441589 @felixcheung Done. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r49493628 --- Diff: R/pkg/R/generics.R --- @@ -581,6 +579,10 @@ setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") }) #' @export setGeneric("where", function(x, condition) { standardGeneric("where") }) +#' @rdname with +#' @export +setGeneric("with") --- End diff -- Fixed this and also re-ordered generics declaration for attach and as.data.frame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-171013162 Jenkins, could you retest please? The error I see is "Error fetching remote repo 'origin'" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-171101731 @SparkQA Could you retest? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-170624621 Happy New Year, folks! Shall we merge this PR? @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-166544706 @shivaram I have addressed all your comments. Should we close this pull request? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r48079567 --- Diff: R/pkg/R/generics.R --- @@ -509,13 +520,8 @@ setGeneric("saveAsTable", function(df, tableName, source, mode, ...) { standardGeneric("saveAsTable") }) -#' @rdname withColumn -#' @export -setGeneric("transform", function(`_data`, ...) {standardGeneric("transform") }) --- End diff -- This has been fixed! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-165923895 @felixcheung Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-165607267 @shivaram I have removed the caching logic as you indicated @felixcheung @sun-rui I have already explained why we can't use R's str() function under the covers. Any more comments? Otherwise, should we merge? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r47984121 --- Diff: R/pkg/R/DataFrame.R --- @@ -2151,3 +2151,97 @@ setMethod("coltypes", rTypes }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str +#' @family DataFrame functions +#' @param object a DataFrame +#' @examples \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +localDF <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)]) + } else { + head(object) + } + +# The number of observations will be displayed only if the number +# of rows of the dataset has already been cached. +if (!is.null(cachedCount)) { + cat(paste0("'", class(object), "': ", cachedCount, " obs. of ", +length(names), " variables:\n")) +} else { + cat(paste0("'", class(object), "': ", length(names), " variables:\n")) +} + +# Whether the ... should be printed at the end of each row +ellipsis <- FALSE + +# Add ellipsis (i.e., "...") if there are more rows than shown +if (!is.null(cachedCount) && (cachedCount > 6)) { + ellipsis <- TRUE +} + +if (nrow(localDF) > 0) { + for (i in 1 : ncol(localDF)) { +firstElements <- "" --- End diff -- I have fixed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r46995766 --- Diff: R/pkg/R/DataFrame.R --- @@ -2151,3 +2151,97 @@ setMethod("coltypes", rTypes }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str +#' @family DataFrame functions +#' @param object a DataFrame +#' @examples \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +localDF <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)]) + } else { + head(object) + } + +# The number of observations will be displayed only if the number +# of rows of the dataset has already been cached. +if (!is.null(cachedCount)) { + cat(paste0("'", class(object), "': ", cachedCount, " obs. of ", +length(names), " variables:\n")) +} else { + cat(paste0("'", class(object), "': ", length(names), " variables:\n")) +} + +# Whether the ... should be printed at the end of each row +ellipsis <- FALSE + +# Add ellipsis (i.e., "...") if there are more rows than shown +if (!is.null(cachedCount) && (cachedCount > 6)) { + ellipsis <- TRUE +} + +if (nrow(localDF) > 0) { + for (i in 1 : ncol(localDF)) { +firstElements <- "" + +# Get the first elements for each column +if (types[i] == "character") { + firstElements <- paste(paste0("\"", localDF[,i], "\""), collapse = " ") +} else { + firstElements <- paste(localDF[,i], collapse = " ") +} + +# Add the corresponding number of spaces for alignment +spaces <- paste(rep(" ", max(nchar(names) - nchar(names[i]))), collapse="") + +# Get the short type. For 'character', it would be 'chr'; +# 'for numeric', it's 'num', etc. --- End diff -- Combining those two lines will end up in 106 characters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-162725183 @felixcheung, @sun-rui As I mentioned in my previous comment, it's not only replacing data.frame for DataFrame in the header. There are also issues with the number of rows and data types (complex ones). For example: > x <- createDataFrame(sqlContext, list(list(as.environment( list("a"="b", "c"="d", "e"="f") > str(x) 'DataFrame': 1 obs. of 1 variables: $ _1: map > str(as.dataframe(x)) 'data.frame': 1 obs. of 1 variable: $ _1:List of 1 ..$ : --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-159366139 @shivaram Any further comments or clarification on the existing ones required from my end? Otherwise, should we merge this PR? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r45537232 --- Diff: R/pkg/R/DataFrame.R --- @@ -2199,3 +2199,97 @@ setMethod("coltypes", rTypes }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str +#' @family DataFrame functions +#' @param object a DataFrame +#' @examples \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)]) + } else { + head(object) + } + +# The number of observations will be displayed only if the number +# of rows of the dataset has already been cached. +if (!is.null(cachedCount)) { --- End diff -- Yes, that's why I added the TODO. In our implementation, we had a global cache to store the number of rows of the datasets in the current session. At some point, we'll need to implement some caching mechanism so that every time you run str() or nrow(), you don't have to do a full data scan. When such caching mechanism is implemented, all we'll need to do is to change this line accordingly: cachedCount <- nrow(object) by cachedCount <- FUNCTION_TO_GET_CACHED_NROW(object) The behavior of str() is such that if nrow() hasn't been cached, the number of rows is simply not shown. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r45537282 --- Diff: R/pkg/R/DataFrame.R --- @@ -2199,3 +2199,97 @@ setMethod("coltypes", rTypes }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str +#' @family DataFrame functions +#' @param object a DataFrame +#' @examples \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +dataFrame <- if (ncol(object) > MAX_COLS) { --- End diff -- Good idea. Let me change that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r45537532 --- Diff: R/pkg/R/DataFrame.R --- @@ -2199,3 +2199,97 @@ setMethod("coltypes", rTypes }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str +#' @family DataFrame functions +#' @param object a DataFrame +#' @examples \dontrun{ +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)]) + } else { + head(object) + } + +# The number of observations will be displayed only if the number +# of rows of the dataset has already been cached. +if (!is.null(cachedCount)) { + cat(paste0("'", class(object), "': ", cachedCount, " obs. of ", +length(names), " variables:\n")) +} else { + cat(paste0("'", class(object), "': ", length(names), " variables:\n")) +} + +# Whether the ... should be printed at the end of each row +ellipsis <- FALSE + +# Add ellipsis (i.e., "...") if there are more rows than shown +if (!is.null(cachedCount) && (cachedCount > 6)) { + ellipsis <- TRUE +} + +if (nrow(dataFrame) > 0) { --- End diff -- Good point :-). I thought about that before but I realized there are three issues: 1) Header is different (DataFrame vs data.frame) 2) Number of rows would not match, and in some cases we don't wanna show it. 3) We're still not clear in the mapping between column types of DataFrame and data.frame I added a comment on JIRA SPARK-10863 (see link below). If we implemented corresponding data types in R, we could leverage some part of utils:::str() in SparkR:::str(). https://issues.apache.org/jira/browse/SPARK-10863 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-158241832 Jenkins, could you retest please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r45413626 --- Diff: R/pkg/R/generics.R --- @@ -561,6 +579,10 @@ setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") }) #' @export setGeneric("where", function(x, condition) { standardGeneric("where") }) +#' @rdname with --- End diff -- Nice catch. I have fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-157786743 @felixcheung @shivaram Any more comments, folks? Otherwise, can we merge this? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/9613#issuecomment-157903598 @shivaram I have updated this branch with master. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org