[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11569


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214907846
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214907849
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57043/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214907828
  
LGTM. Merging this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214907761
  
**[Test build #57043 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57043/consoleFull)**
 for PR 11569 at commit 
[`838c915`](https://github.com/apache/spark/commit/838c9155839bbb7fd4d5f855a9d88ae68fef2ffb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214904802
  
**[Test build #57043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57043/consoleFull)**
 for PR 11569 at commit 
[`838c915`](https://github.com/apache/spark/commit/838c9155839bbb7fd4d5f855a9d88ae68fef2ffb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214903911
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214903894
  
**[Test build #57038 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57038/consoleFull)**
 for PR 11569 at commit 
[`fc2f6a3`](https://github.com/apache/spark/commit/fc2f6a31166ac895b5c2ce05074f5c7edf372706).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214903913
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57038/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214900440
  
**[Test build #57038 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57038/consoleFull)**
 for PR 11569 at commit 
[`fc2f6a3`](https://github.com/apache/spark/commit/fc2f6a31166ac895b5c2ce05074f5c7edf372706).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214898708
  
LGTM. Thanks @olarayej - I just had a couple of minor comments about using 
`SparkDataFrame`. Other than that this looks good to merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61170989
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,126 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family SparkDataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histStats <- histogram(irisDF, irisDF$Sepal_Length, nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts)) +
+#' geom_bar(stat = "identity") +
+#' xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "SparkDataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
--- End diff --

`DataFrame` -> `SparkDataFrame` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61170976
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,126 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family SparkDataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
--- End diff --

As @felixcheung mentioned before `DataFrame`-> `SparkDataFrame` ? Or we can 
just delete this comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214898243
  
@shivaram @felixcheung I have addressed all your comments. Anything else? 
Or shall we merge? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214894940
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214894944
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57029/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214894669
  
**[Test build #57029 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57029/consoleFull)**
 for PR 11569 at commit 
[`e9dbc5b`](https://github.com/apache/spark/commit/e9dbc5b27c258777a539723e0ad4676db928736b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61168496
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2465,6 +2465,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+  getColumn(df, colname)
+
+} else if (class(col) == "Column") {
+
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
--- End diff --

@shivaram Yes, I have fixed this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214890845
  
**[Test build #57029 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57029/consoleFull)**
 for PR 11569 at commit 
[`e9dbc5b`](https://github.com/apache/spark/commit/e9dbc5b27c258777a539723e0ad4676db928736b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214890244
  
@shivaram @felixcheung Looks like the version of lint-r running on the 
build server is different than the one on Spark's Github. Even though lint-r 
passes on my local, I keep getting this errors:

R/DataFrame.R:2542:40: style: Put spaces around all infix operators.
   collapse=""



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214888704
  
**[Test build #57028 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57028/consoleFull)**
 for PR 11569 at commit 
[`cd7ba4c`](https://github.com/apache/spark/commit/cd7ba4c3af26beba4ac4c0f09ea6f3560069d5a4).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214888718
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57028/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214888714
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214887742
  
**[Test build #57028 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57028/consoleFull)**
 for PR 11569 at commit 
[`cd7ba4c`](https://github.com/apache/spark/commit/cd7ba4c3af26beba4ac4c0f09ea6f3560069d5a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61009021
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
--- End diff --

Yeah, let me deliver that with Shivaram's fix as well. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61008558
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
--- End diff --

DataFrame -> SparkDataFrame, or just omit this comment line...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61008598
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
--- End diff --

`"DataFrame"` -> `"SparkDataFrame"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61008503
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
--- End diff --

this has changed as well
`@family SparkDataFrame functions`
sorry this is such a moving target


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214539753
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56927/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214539717
  
**[Test build #56927 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56927/consoleFull)**
 for PR 11569 at commit 
[`976e412`](https://github.com/apache/spark/commit/976e412e7cdcbee95164f05eaf088e5ec7b08160).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214539752
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214537526
  
**[Test build #56927 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56927/consoleFull)**
 for PR 11569 at commit 
[`976e412`](https://github.com/apache/spark/commit/976e412e7cdcbee95164f05eaf088e5ec7b08160).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214528787
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214528791
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56924/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214528756
  
**[Test build #56924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56924/consoleFull)**
 for PR 11569 at commit 
[`7cdb9e8`](https://github.com/apache/spark/commit/7cdb9e83ee4a7223347a7d10eac9ab2a3ce51ca7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214528780
  
**[Test build #56924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56924/consoleFull)**
 for PR 11569 at commit 
[`7cdb9e8`](https://github.com/apache/spark/commit/7cdb9e83ee4a7223347a7d10eac9ab2a3ce51ca7).
 * This patch **fails some tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-23 Thread felixcheung
Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213854068
  
please rebase to pick up `DataFrame` -> `SparkDataFrame` class name change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60797789
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2465,6 +2465,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+  getColumn(df, colname)
+
+} else if (class(col) == "Column") {
+
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
--- End diff --

Do we need to check if `x` is a column name already present in the data 
frame ? For example I ran the code
```
irisDF$x <- irisDF$Petal_Width + 2.0
histogram(irisDF, irisDF$x, 8)
```
and I got an error
```
org.apache.spark.sql.AnalysisException: resolved attribute(s) x#141 missing 
from Species#4,Sepal_Length#0,x#269,Petal_Width#3,Petal_Length#2,Sepal_Width#1 
in operator !Project 
[Sepal_Length#0,Sepal_Width#1,Petal_Length#2,Petal_Width#3,Species#4,x#269,castcast(castx#141
 - 2.1) / 2.4) * 1.0) as int) as double) / 1.0) / 0.125) - CASE WHEN 
cast(castx#141 - 2.1) / 2.4) * 1.0) as int) as double) / 1.0) / 
0.125) = cast(cast(((cast(castx#141 - 2.1) / 2.4) * 1.0) as int) as 
double) / 1.0) / 0.125) as int) as double)) && NOT (x#141 = 2.1)) THEN 1.0 
ELSE 0.0 END) as int) AS bins#325]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213528826
  
@felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we 
merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213528609
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213528509
  
**[Test build #56711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56711/consoleFull)**
 for PR 11569 at commit 
[`fc4c536`](https://github.com/apache/spark/commit/fc4c536ca55fe4beefc27139dad03093cff7194e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213528612
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56711/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213523010
  
**[Test build #56711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56711/consoleFull)**
 for PR 11569 at commit 
[`fc4c536`](https://github.com/apache/spark/commit/fc4c536ca55fe4beefc27139dad03093cff7194e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread felixcheung
Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213188379
  
looks good except 1 minor doc comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60675031
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2465,6 +2465,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
--- End diff --

`#' @family DataFrame functions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213124256
  
@felixcheung @shivaram I have addressed all your comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213117725
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213117603
  
**[Test build #56589 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56589/consoleFull)**
 for PR 11569 at commit 
[`3e19fe8`](https://github.com/apache/spark/commit/3e19fe889c2c9709783503cd1082f4ab7ba6d37c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213117728
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56589/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213113387
  
**[Test build #56589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56589/consoleFull)**
 for PR 11569 at commit 
[`3e19fe8`](https://github.com/apache/spark/commit/3e19fe889c2c9709783503cd1082f4ab7ba6d37c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213113112
  
**[Test build #56588 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56588/consoleFull)**
 for PR 11569 at commit 
[`c2c4601`](https://github.com/apache/spark/commit/c2c4601b1a09e23e5b1be64fb027b92f9638da20).
 * This patch **fails R style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213113120
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213113123
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56588/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213111260
  
**[Test build #56588 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56588/consoleFull)**
 for PR 11569 at commit 
[`c2c4601`](https://github.com/apache/spark/commit/c2c4601b1a09e23e5b1be64fb027b92f9638da20).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60651721
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,107 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  getColumn(df, colname)
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
+
+  # Filter NA values in the target column. Cannot remove all 
other columns
+  # since given Column may be an expression on one or more 
existing columns
+  df <- na.omit(df)
+
+  colname <- "x"
+  col
+}
+
+# At this point, df only has one column: the one to compute 
the histogram from
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
--- End diff --

Yeah, let me move it to DataFrame.R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60651656
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,107 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  getColumn(df, colname)
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
+
+  # Filter NA values in the target column. Cannot remove all 
other columns
+  # since given Column may be an expression on one or more 
existing columns
+  df <- na.omit(df)
+
+  colname <- "x"
+  col
+}
+
+# At this point, df only has one column: the one to compute 
the histogram from
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
--- End diff --

@felixcheung That would never be the case since I'm normalizing the data to 
be within [0, 1]. Line 2715:

`  xnorm <- (x - min) / (max - min)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-20 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60452619
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

@felixcheung I need to remove NA values from Column `x` too, since `x` 
could be an arbitrary Column expression. Therefore, the `na.omit() `invocation 
should go afterwards


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60319306
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,107 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  getColumn(df, colname)
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
+
+  # Filter NA values in the target column. Cannot remove all 
other columns
+  # since given Column may be an expression on one or more 
existing columns
+  df <- na.omit(df)
+
+  colname <- "x"
+  col
+}
+
+# At this point, df only has one column: the one to compute 
the histogram from
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
--- End diff --

would this truncate the value if xnorm was close to 2*10^9 before * 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60318933
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

perhaps swap these two lines?
```
+  df$x <- col
+  df <- na.omit(df)
```
to
```
+  df <- na.omit(df)
+  df$x <- col
```
that would make it more clear?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread felixcheung
Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212145097
  
thanks, as I have commented 
[earlier](https://github.com/apache/spark/pull/11569#issuecomment-200947392), I 
suspect this is better belong to DataFrame.R instead, since this is currently a 
function on DataFrame. I think that would be more discoverable/maintainable. 
When #11336 is resolved we could merge/update this to work by Column only.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212141260
  
@shivaram @felixcheung I have addressed all your comments. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212139757
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56269/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212139755
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212139610
  
**[Test build #56269 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56269/consoleFull)**
 for PR 11569 at commit 
[`b03c335`](https://github.com/apache/spark/commit/b03c335a6dc9818f54fc2633fb149f9f3ad0277d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212135413
  
**[Test build #56269 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56269/consoleFull)**
 for PR 11569 at commit 
[`b03c335`](https://github.com/apache/spark/commit/b03c335a6dc9818f54fc2633fb149f9f3ad0277d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212133278
  
**[Test build #56267 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56267/consoleFull)**
 for PR 11569 at commit 
[`046b7da`](https://github.com/apache/spark/commit/046b7dad841bbf13d1d4b93bf001474f74b25865).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212133286
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212133290
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56267/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212132210
  
**[Test build #56267 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56267/consoleFull)**
 for PR 11569 at commit 
[`046b7da`](https://github.com/apache/spark/commit/046b7dad841bbf13d1d4b93bf001474f74b25865).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60287324
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

@shivaram @felixcheung The original DataFrame is NOT being mutated. As a 
matter of fact, R doesn't support passing parameters by reference, so 
effectively, a new copy of the DataFrame is being created every time this 
function is being invoked. This should, in turn, trigger the creation of a new 
corresponding Java object.

To illustrate this, notice the dataset doesn't change after running 
histogram():
```
> str(irisDF)
'DataFrame': 5 variables:
 $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4
 $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9
 $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7
 $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4
 $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

> histogram(irisDF, irisDF$Sepal_Length + 1)
bins counts centroids
1 0  9  5.48
2 1 23  5.84
3 2 14  6.20
4 3 27  6.56
5 4 22  6.92
6 5 20  7.28
7 6 18  7.64
8 7  6  8.00
9 8  5  8.36
109  6  8.72

> str(irisDF)

[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212057016
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212057021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56245/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212056674
  
**[Test build #56245 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56245/consoleFull)**
 for PR 11569 at commit 
[`adc3446`](https://github.com/apache/spark/commit/adc34461a869d4b4c072952b999c896047e994d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60284165
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
--- End diff --

Yes. Let me fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212051119
  
**[Test build #56245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56245/consoleFull)**
 for PR 11569 at commit 
[`adc3446`](https://github.com/apache/spark/commit/adc34461a869d4b4c072952b999c896047e994d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60281798
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

Perhaps create a new intermediate DataFrame instead of mutating the input 
DataFrame? The return value of this function is a local/native R data.frame 
anyway


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212049730
  
@felixcheung Could you also take a look at this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60281327
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

Similar question as above. I'm wondering if there is a better way than 
adding `bins` as a column to the input DF. Ideally, as a user I would assume 
that `histogram` is a safe function in that it doesn't mutate the input data 
given to it. I am not sure whats an easy solution here though. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60280663
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
--- End diff --

In that case can we check if this column is present in the dataframe and 
not add it if it is present ? I just don't want to keep adding spurious columns 
if somebody keeps calling `histogram(df$average)` repeatedly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60278988
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
--- End diff --

@shivaram That's because the user could do:

`histogram(irisDF,  irisDF$Sepal_Length + 1, nbins=12)`

In that case, the given Column doesn't belong to the DataFrame. This gives 
the user a lot of flexibility and R-like feel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60278119
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
--- End diff --

If we are filtering NA values, we should also do it for the other case ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60278057
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
--- End diff --

I'm not sure why we  need to add the column to the dataframe ? Isn't it 
already a part of the dataframe ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60277269
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -36,3 +36,4 @@ Collate:
 'stats.R'
 'types.R'
 'utils.R'
+RoxygenNote: 5.0.1
--- End diff --

I think this might have been auto-generated by roxygen - Can we revert this 
file for this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-12 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-209030552
  
@felixcheung @shivaram This is done. Shall we merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-30 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-203678252
  
Looks good to you @felixcheung?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201607130
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201607135
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54237/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201606853
  
**[Test build #54237 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54237/consoleFull)**
 for PR 11569 at commit 
[`2800492`](https://github.com/apache/spark/commit/2800492e307253d7f6004944b2e4beb11f76c330).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201597000
  
@felixcheung I just added support for Columns or characters. 

In my opinion, histogram() is column function and when we sort out #11336, 
it wouldn't have to take a DataFrame as a parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201596039
  
**[Test build #54237 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54237/consoleFull)**
 for PR 11569 at commit 
[`2800492`](https://github.com/apache/spark/commit/2800492e307253d7f6004944b2e4beb11f76c330).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201594759
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54234/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201594744
  
**[Test build #54234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54234/consoleFull)**
 for PR 11569 at commit 
[`19f995c`](https://github.com/apache/spark/commit/19f995c8f72efb58b818f81a122529680c24f5ec).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201594755
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201593441
  
**[Test build #54234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54234/consoleFull)**
 for PR 11569 at commit 
[`19f995c`](https://github.com/apache/spark/commit/19f995c8f72efb58b818f81a122529680c24f5ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread felixcheung
Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-200947392
  
btw, is this still the right place? functions.R works with Column, if this 
works with DataFrame, should it go to DataFrame.R?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-200928627
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54051/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-200928622
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-200928468
  
**[Test build #54051 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54051/consoleFull)**
 for PR 11569 at commit 
[`dbc9d75`](https://github.com/apache/spark/commit/dbc9d75584b7ae3ab952c7a88b19b21ad00a5a82).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57349827
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-200919208
  
**[Test build #54051 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54051/consoleFull)**
 for PR 11569 at commit 
[`dbc9d75`](https://github.com/apache/spark/commit/dbc9d75584b7ae3ab952c7a88b19b21ad00a5a82).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >