[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2017-10-30 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@shivaram: Have you reviewed this? If the intent is to merge it, I'll 
gladly update the code. @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2017-01-11 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
Happy New Year, folks! Any updates on this? @shivaram @falaki


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-12-14 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki @shivaram Shall we merge this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-11-16 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@shivaram @felixcheung @falaki Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-11-08 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
Folks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-11-02 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@shivaram @falaki @felixcheung Any additional comments? Otherwise, are we 
ready to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-10-25 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@felixcheung I'm done! Thanks for your comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-10-25 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@felixcheung When a user types a variable name on the R shell, it triggers 
method showDefault() which, in turn, invokes show(). I wrote an implementation 
of show() for Column which, in turn, invokes head() (not collect), showing the 
first 20 elements of the dataset. This mimics R behavior and I think it also 
helps with usability. However, if the agreement is not to have that, I can just 
remove show method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-10-25 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@felixcheung @falaki Folks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] head() and show() for Columns

2016-10-21 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@felixcheung @falaki I have addressed all your comments. Shall we merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] head() and show() for Colum...

2016-10-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r84150590
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3321,3 +3328,11 @@ setMethod("randomSplit",
 }
 sapply(sdfs, dataFrame)
   })
+
+# A global singleton for an empty SparkR DataFrame.
+getEmptySparkRDataFrame <- function() {
+  if (is.null(.sparkREnv$EMPTY_DF)) {
+.sparkREnv$EMPTY_DF <- as.DataFrame(data.frame(0))
+  }
+  return(.sparkREnv$EMPTY_DF)
--- End diff --

get() would throw an error if the variable is not defined. I'll use exists()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-14 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@felixcheung @falaki I have addressed all your comments and tests pass now. 
Thank you! cc @aloknsingh 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-12 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki @felixcheung I have addressed all your comments. I'm getting two 
documentation warnings which seem to be making the build fail:

1)
```
Undocumented S4 methods:
  generic 'head' and siglist 'Column'
  generic 'show' and siglist 'Column'
```
The documentation for these is on Data.Frame.R. I don't see a need for 
duplicating the docs in column.R


2)
```
Undocumented arguments in documentation object 'Column-class'
  'df'

Undocumented arguments in documentation object 'head'
  '...'
```
I do have documentation for slot df in class Column. And also, I don't have 
... as part of the signature of method head. Not sure why this warning comes up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-12 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r83057347
  
--- Diff: R/pkg/R/functions.R ---
@@ -2836,7 +2845,11 @@ setMethod("lpad", signature(x = "Column", len = 
"numeric", pad = "character"),
 setMethod("rand", signature(seed = "missing"),
   function(seed) {
 jc <- callJStatic("org.apache.spark.sql.functions", "rand")
-column(jc)
+
+# By assigning a one-row data.frame, the result of this 
function can be collected
+# returning a one-element Column
+df <- as.DataFrame(sparkRSQL.init(), data.frame(0))
--- End diff --

@felixcheung That's a good idea. I have created a singleton accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-11 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r82919122
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1168,12 +1179,14 @@ setMethod("take",
 
 #' Head
 #'
-#' Return the first \code{num} rows of a SparkDataFrame as a R data.frame. 
If \code{num} is not
-#' specified, then head() returns the first 6 rows as with R data.frame.
+#' Return the first elements of a dataset. If \code{x} is a 
SparkDataFrame, its first 
+#' rows will be returned as a data.frame. If the dataset is a 
\code{Column}, its first 
+#' elements will be returned as a vector. The number of elements to be 
returned
+#' is given by parameter \code{num}. Default value for \code{num} is 6.
 #'
-#' @param x a SparkDataFrame.
-#' @param num the number of rows to return. Default is 6.
-#' @return A data.frame.
+#' @param x A SparkDataFrame or Column
--- End diff --

Not sure I follow here. Could you point to the specific example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-11 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r82918200
  
--- Diff: R/pkg/R/column.R ---
@@ -32,35 +34,57 @@ setOldClass("jobj")
 #' @export
 #' @note Column since 1.4.0
 setClass("Column",
- slots = list(jc = "jobj"))
+ slots = list(jc = "jobj", df = "SparkDataFrameOrNull"))
 
 #' A set of operations working with SparkDataFrame columns
 #' @rdname columnfunctions
 #' @name columnfunctions
 NULL
-
-setMethod("initialize", "Column", function(.Object, jc) {
+setMethod("initialize", "Column", function(.Object, jc, df) {
   .Object@jc <- jc
+
+  # Some Column objects don't have any referencing DataFrame. In such 
case, df will be NULL.
+  if (missing(df)) {
+df <- NULL
+  }
+  .Object@df <- df
   .Object
 })
 
+setMethod("show", signature = "Column", definition = function(object) {
--- End diff --

Sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-11 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r82911261
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1049,11 +1055,16 @@ setMethod("dim",
 #' @export
 #' @examples
 #'\dontrun{
-#' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
-#' collected <- collect(df)
-#' firstName <- collected[[1]]$name
+#' # Initialize Spark context and SQL context
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
--- End diff --

Sure. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-11 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r82911244
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1035,10 +1035,16 @@ setMethod("dim",
 c(count(x), ncol(x))
   })
 
-#' Collects all the elements of a SparkDataFrame and coerces them into an 
R data.frame.
+#' Download Spark datasets into R
--- End diff --

Sure. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11336: [SPARK-9325][SPARK-R] collect() head() and show()...

2016-10-11 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11336#discussion_r82910308
  
--- Diff: R/pkg/R/functions.R ---
@@ -2836,7 +2845,11 @@ setMethod("lpad", signature(x = "Column", len = 
"numeric", pad = "character"),
 setMethod("rand", signature(seed = "missing"),
   function(seed) {
 jc <- callJStatic("org.apache.spark.sql.functions", "rand")
-column(jc)
+
+# By assigning a one-row data.frame, the result of this 
function can be collected
+# returning a one-element Column
+df <- as.DataFrame(sparkRSQL.init(), data.frame(0))
--- End diff --

See my comment from March 30 to illustrate why this is needed. I'll change 
sparkRSQL.init() to sparkR.session(). Thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki Yeah but those warnings are making the build fail (see below). Is 
that okay? Now I see a new"Checks" section. I may be outdated with the 
protocols as it's been a while I didn't commit :-). Thanks!

```
Failed 
-
1. Error: column binary mathfunctions (@test_sparkSQL.R#1256) 
--
error in evaluating the argument 'x' in selecting a method for function 
'collect': 
  error in evaluating the argument 'col' in selecting a method for function 
'select': (converted from warning) 'sparkRSQL.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated")
1: expect_equal(class(collect(select(df, rand()))[2, 1]), "numeric") at 
/home/jenkins/workspace/SparkPullRequestBuilder/R/lib/SparkR/tests/testthat/test_sparkSQL.R:1256
2: compare(object, expected, ...)
3: collect(select(df, rand()))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki Thanks for your comments. Yeah, before removing collect/show, I 
just wanted to rebase to current upstream master. I'm getting a build error 
which is actually a warning, not even an R error:

```
(converted from warning) 'sparkRSQL.init' is deprecated.
Use 'sparkR.session' instead.
```

I don't explicitly use sparkRSQL.init anywhere in my code so I'm 
investigating. If you have any suggestion, it'd be nice. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki Sorry, I was out of town. Let me get back to this today. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-06 Thread olarayej
Github user olarayej commented on the issue:

https://github.com/apache/spark/pull/11336
  
@falaki Absolutely. Let me do that. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-05-06 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-217586684
  
I just got this branch up to date. Any comments, folks? @shivaram @falaki 
@felixcheung @rxin @sun-rui @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-04-27 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-215253259
  
@shivaram I've changed default value to drop=F as you suggested. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-04-27 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-215181464
  
@shivaram @sun-rui @felixcheung This one's ready. Shall we merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214898243
  
@shivaram @felixcheung I have addressed all your comments. Anything else? 
Or shall we merge? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61168496
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2465,6 +2465,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+  getColumn(df, colname)
+
+} else if (class(col) == "Column") {
+
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
--- End diff --

@shivaram Yes, I have fixed this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-26 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-214890244
  
@shivaram @felixcheung Looks like the version of lint-r running on the 
build server is different than the one on Spark's Github. Even though lint-r 
passes on my local, I keep getting this errors:

R/DataFrame.R:2542:40: style: Put spaces around all infix operators.
   collapse=""



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-25 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r61009021
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2469,6 +2469,110 @@ setMethod("drop",
 base::drop(x)
   })
 
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family DataFrame functions
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
--- End diff --

Yeah, let me deliver that with Shivaram's fix as well. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-22 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213528826
  
@felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we 
merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-213124256
  
@felixcheung @shivaram I have addressed all your comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60651721
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,107 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  getColumn(df, colname)
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
+
+  # Filter NA values in the target column. Cannot remove all 
other columns
+  # since given Column may be an expression on one or more 
existing columns
+  df <- na.omit(df)
+
+  colname <- "x"
+  col
+}
+
+# At this point, df only has one column: the one to compute 
the histogram from
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
--- End diff --

Yeah, let me move it to DataFrame.R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-21 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60651656
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,107 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column and remove all other 
columns
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  getColumn(df, colname)
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset. This is to support 
Columns that
+  # don't belong to the DataFrame but are rather expressions
+  df$x <- col
+
+  # Filter NA values in the target column. Cannot remove all 
other columns
+  # since given Column may be an expression on one or more 
existing columns
+  df <- na.omit(df)
+
+  colname <- "x"
+  col
+}
+
+# At this point, df only has one column: the one to compute 
the histogram from
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
--- End diff --

@felixcheung That would never be the case since I'm normalizing the data to 
be within [0, 1]. Line 2715:

`  xnorm <- (x - min) / (max - min)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-20 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60452619
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

@felixcheung I need to remove NA values from Column `x` too, since `x` 
could be an arbitrary Column expression. Therefore, the `na.omit() `invocation 
should go afterwards


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-212141260
  
@shivaram @felixcheung I have addressed all your comments. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60287324
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
+  colname <- "x"
+  col
+}
+
+stats <- collect(describe(df[, colname]))
+min <- as.numeric(stats[4, 2])
+max <- as.numeric(stats[5, 2])
+
+# Normalize the data
+xnorm <- (x - min) / (max - min)
+
+# Round the data to 4 significant digits. This is to avoid 
rounding issues.
+xnorm <- cast(xnorm * 1, "integer") / 1.0
+
+# Since min = 0, max = 1 (data is already normalized)
+normBinSize <- 1 / nbins
+binsize <- (max - min) / nbins
+approxBins <- xnorm / normBinSize
+
+# Adjust values that are equal to the upper bound of each bin
+bins <- cast(approxBins -
+ ifelse(approxBins == cast(approxBins, "integer") 
& x != min, 1, 0),
+ "integer")
+
+df$bins <- bins
--- End diff --

@shivaram @felixcheung The original DataFrame is NOT being mutated. As a 
matter of fact, R doesn't support passing parameters by reference, so 
effectively, a new copy of the DataFrame is being created every time this 
function is being invoked. This should, in turn, trigger the creation of a new 
corresponding Java object.

To illustrate this, notice the dataset doesn't change after running 
histogram():
```
> str(irisDF)
'DataFrame': 5 variables:
 $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4
 $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9
 $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7
 $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4
 $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

> histogram(irisDF, irisDF$Sepal_Lengt

[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60284165
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
--- End diff --

Yes. Let me fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r60278988
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,100 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). Default value is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @rdname histogram
+#' @family agg_funcs
+#' @export
+#' @examples 
+#' \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, the histogram can 
be
+#' # rendered using the ggplot2 library:
+#'
+#' require(ggplot2)
+#' plot <- ggplot(histStats, aes(x = centroids, y = counts))
+#' plot <- plot + geom_histogram(data = histStats, stat = "identity", 
binwidth = 100)
+#' plot <- plot + xlab("Sepal_Length") + ylab("Frequency")   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame", col = "characterOrColumn"),
+  function(df, col, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
+}
+
+# Round nbins to the smallest integer
+nbins <- floor(nbins)
+
+# Validate col
+if (is.null(col)) {
+  stop("col must be specified.")
+}
+
+colname <- col
+x <- if (class(col) == "character") {
+  if (!colname %in% names(df)) {
+stop("Specified colname does not belong to the given 
DataFrame.")
+  }
+
+  # Filter NA values in the target column
+  df <- na.omit(df[, colname])
+
+  # TODO: This will be when improved SPARK-9325 or SPARK-13436 
are fixed
+  eval(parse(text = paste0("df$", colname)))
+} else if (class(col) == "Column") {
+  # Append the given column to the dataset
+  df$x <- col
--- End diff --

@shivaram That's because the user could do:

`histogram(irisDF,  irisDF$Sepal_Length + 1, nbins=12)`

In that case, the given Column doesn't belong to the DataFrame. This gives 
the user a lot of flexibility and R-like feel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-04-12 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-209030552
  
@felixcheung @shivaram This is done. Shall we merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-04-12 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-209030229
  
@shivaram @falaki @felixcheung @rxin @sun-rui Any thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-30 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-203678252
  
Looks good to you @felixcheung?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-30 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-203678123
  
@sun-rui @shivaram Shall we merge this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-03-30 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-203672948
  
Thanks @sun-rui @rxin @shivaram  for your inputs. To alleviate the 
confusion on which columns can/cannot be collected, I propose the following 
(already pushed the code):

Currently there are 15 SparkR functions that return an ‘orphan’ Column 
with no parent DataFrame:
```
rand, rand, unix_timestamp,
struct, expr, column, lag, lead, lit, cume_dist, dense_rank,
ntile, percent_rank, rank, row_number
```
The first three (i.e., rand, randn, and unix_timestamp) can be nicely 
collected as single elements. For example:
```
> rand()
[1] 0.01483325
```
The remaining ones don’t make sense unless there’s an associated 
DataFrame. Therefore, an empty vector will be returned:
```
> column("Species")
Species


> collect(column("Species"))
character(0)
```

I think it makes sense: If you don’t associate a Column with a DataFrame, 
there’s nothing to be collected. Now, for Columns that do belong to a 
DataFrame, collecting columns SIGNIFICANTLY improves usability in 138 
functions/operators (besides other issues in the design document), for example:

> irisDF$Sepal_Length * 100
 [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 
510 570 510…

versus:

> head(select(irisDF, irisDF$Sepal_Length * 100), 20)[, 1]
 [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 
510 570 510

@shivaram has a very valid point: this introduces discrepancies in the 
Spark API’s across multiple languages. I believe this is not necessarily bad 
as R, especially, is a slightly different animal which already has a specific 
behavior for columns (i.e., vectors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-25 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-201597000
  
@felixcheung I just added support for Columns or characters. 

In my opinion, histogram() is column function and when we sort out #11336, 
it wouldn't have to take a DataFrame as a parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57349827
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57343915
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
--- End diff --

Yeah, but then what if the user wants to do:

`hist(irisDF, irisDF$Sepal_Length + 1)`

describe() would fail as this column doesn't belong to df.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-24 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57342255
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
--- End diff --

R doesn't (see below). Let me change my code so it rounds nbins to the 
lowest integer.

> str(hist(iris$Sepal.Length, breaks=10.5))
List of 6
 $ breaks  : num [1:9] 4 4.5 5 5.5 6 6.5 7 7.5 8
 $ counts  : int [1:8] 5 27 27 30 31 18 6 6
 $ density : num [1:8] 0.0667 0.36 0.36 0.4 0.4133 ...
 $ mids: num [1:8] 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
 $ xname   : chr "iris$Sepal.Length"
 $ equidist: logi TRUE


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-23 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57259854
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
--- End diff --

@felixcheung Yeah, I thought of that but I don't know how to compute the 
min and max in one single pass given a Column (not a name). I used describe() 
which requires a column name. I also tried agg, but it cannot compute more than 
1 stat per column:

```
> collect(agg(irisDF, Sepal_Length="max", Sepal_Width="min"))
  max(Sepal_Length) min(Sepal_Width)
1   7.92

> collect(agg(irisDF, Sepal_Length="max", Sepal_Length="min"))
  max(Sepal_Length)
1   7.9

Suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-23 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11569#discussion_r57252393
  
--- Diff: R/pkg/R/functions.R ---
@@ -2638,3 +2638,81 @@ setMethod("sort_array",
 jc <- callJStatic("org.apache.spark.sql.functions", 
"sort_array", x@jc, asc)
 column(jc)
   })
+
+#' This function computes a histogram for a given SparkR Column.
+#' 
+#' @name histogram
+#' @title Histogram
+#' @param nbins the number of bins (optional). The default is 10.
+#' @param df the DataFrame containing the Column to build the histogram 
from.
+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and 
centroids.
+#' @examples \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Compute histogram statistics
+#' histData <- histogram(df, "colname"Sepal_Length", nbins = 12)
+#'
+#' # Once SparkR has computed the histogram statistics, it would be very 
easy to
+#' # render the histogram using R's visualization packages such as ggplot2.
+#'   
+#' } 
+setMethod("histogram",
+  signature(df = "DataFrame"),
+  function(df, colname, nbins = 10) {
+# Validate nbins
+if (nbins < 2) {
+  stop("The number of bins must be a positive integer number 
greater than 1.")
--- End diff --

Yup, you could have a histogram with 2 bins:

> histogram(irisDF, "Sepal_Length", nbins=2)
bins counts centroids
10 95   5.2
21 55   7.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13734][SPARKR] Added histogram function

2016-03-21 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11569#issuecomment-199552264
  
@shivaram @sun-rui @felixcheung Yeah, that makes sense. I modified 
histogram() function so now it only computes the histogram statistics. There's 
neither rendering nor dependency on ggplot2 anymore. I think the histogram 
stats are still very useful for an R user and if they wanna plot it later, 
they're free to use any of R packages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-21 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-199464512
  
@sun-rui Done with the style issues. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11318#discussion_r56418618
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1217,29 +1217,38 @@ setMethod("[[", signature(x = "DataFrame", i = 
"numericOrcharacter"),
 
 #' @rdname subset
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-19 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-197566393
  
@felixcheung @shivaram @sun-rui I have addressed all your comments. Do we 
have a consensus on the default value for drop? I'd say drop=T makes sense cuz 
R does it that way. Anyways, it's just a one-line change for me. Please let me 
know! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-10 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11220#issuecomment-195127575
  
Thanks, @shivaram!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-08 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r55481841
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
--- End diff --

Thanks @felixcheung for investigating this further!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-08 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-194128322
  
@felixcheung @sun-rui @shivaram Can you folks please take a look at this 
one? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-08 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r55418947
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
+  signature(x = "DataFrame"),
   function(x, value) {
+
+# Check parameter integrity
+if (class(value) != "character") {
+  stop("Invalid column names.")
+}
+
+if (length(value) != ncol(x)) {
+  stop(
+"Column names must have the same length as the number of 
columns in the dataset.")
+}
+
+if (any(is.na(value))) {
+  stop("Column names cannot be NA.")
+}
+
+# Check if the column names have . in it
+if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) {
--- End diff --

@sun-rui @felixcheung @shivaram Folks: this is a really simple thing. Shall 
we merge it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Added histogram function

2016-03-07 Thread olarayej
GitHub user olarayej opened a pull request:

https://github.com/apache/spark/pull/11569

Added histogram function

## What changes were proposed in this pull request?

Added method histogram() to compute the histogram of a Column

**Usage:** 
# Create a DataFrame from the Irijjs dataset
irisDF <- createDataFrame(sqlContext, iris)

# Render a histogram for the Sepal_Length column 
histogram(irisDF, "Sepal_Length", nbins=12)

Note: Usage will change once SPARK-9325 is figured out so that histogram() 
only takes a Column as a parameter, as opposed to a DataFrame and a name

## How was this patch tested?

All unit tests pass. I added specific unit cases for different scenarios.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olarayej/spark SPARK-13734

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11569.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11569


commit efc2f6634b54cd91e4946d4d4e04be769769f4ad
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com>
Date:   2016-03-08T00:19:07Z

Added histogram function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-03-04 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-192400350
  
@sun-rui Does that make sense to you? @shivaram @felixcheung Any comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-03-02 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-191399711
  
Also, the fact that the size of a column depends on the join seems 
counter-intuitive for an R user:

```
> dim(irisDF2)
[1] 150   5

> dim(irisDF)
[1] 150   5

> x <- irisDF$Sepal_Length + irisDF2$Sepal_Length
```
In R, x will always have 150 elements. However:

```
# Cartesian product
> df3 <- join(irisDF, irisDF2)
> dim(select(df3, x))
[1] 22500 1

# Inner join by Species
> df4 <- merge(irisDF, irisDF2, by="Species")
> dim(select(df4, x))
[1] 75001

```
I still think SparkR shouldn't allow operations between columns coming from 
different DataFrames. And, in the case of a join, operations can be performed 
on the joined DataFrame (e.g., df3) as opposed to the original ones (e.g., 
irisDF and irisDF2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-03-01 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-191003387
  
@sun-rui Yes. In that case, c3 will be only associated to df3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-03-01 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-190867380
  
SparkR doesn't support operations between columns from different DataFrame 
objects. Yet you can do:

```
c1 <- df1$c1
c2 <- df2$c2
c3 < - c1 + c2
```
c3 can't be used at all. See examples below:

```
## Create two DataFrames from Iris
> irisDF <- createDataFrame(sqlContext, iris)
> irisDF2 <- createDataFrame(sqlContext, iris)

## Create Column x, adding two Columns in two DataFrame's
> x <- irisDF$Sepal_Length + irisDF2$Sepal_Length

## You can't use Column x as a predicate
> irisDF[x > 0, ]
16/03/01 11:04:19 ERROR RBackendHandler: filter on 76 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: resolved attribute(s) 
Sepal_Length#20 missing from 
Sepal_Length#15,Petal_Width#18,Sepal_Width#16,Petal_Length#17,Species#19 in 
operator !Filter ((Sepal_Length#15 + Sepal_Length#20) > 0.0);

## You can't select Column x either
> select(irisDF, x)
16/03/01 11:04:43 ERROR RBackendHandler: select on 76 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: resolved attribute(s) 
Sepal_Length#20 missing from 
Sepal_Length#15,Petal_Width#18,Sepal_Width#16,Petal_Length#17,Species#19 in 
operator !Project [(Sepal_Length#15 + Sepal_Length#20) AS (Sepal_Length + 
Sepal_Length)#25];

> select(irisDF2, x)
16/03/01 11:04:45 ERROR RBackendHandler: select on 91 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: resolved attribute(s) 
Sepal_Length#15 missing from 
Sepal_Length#20,Sepal_Width#21,Species#24,Petal_Width#23,Petal_Length#22 in 
operator !Project [(Sepal_Length#15 + Sepal_Length#20) AS (Sepal_Length + 
Sepal_Length)#26];

```
In my opinion, we should throw an error if the user is trying to operate on 
Columns coming from different DataFrames.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-29 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-190334385
  
Can any of you folks please take a look at the code? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-25 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-189053702
  
@felixcheung It wasn't an R issue after all. The problem was that I hadn't 
been able to rebuild Spark in the last couple days due to SPARK-13431, and I 
needed changes on SPARK-12799. Now that it’s fixed, everything runs fine on R 
3.2.2. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-24 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-188484496
  
Thanks, folks. Looks like all test pass now! :-)

However, on my environment (R 3.2.2), two tests don't pass. We should be 
careful whenever upgrading the R version:

```
1. Failure (at test_sparkSQL.R#1052): column functions 
-
result not equal to expected
Names: 1 string mismatch

2. Failure (at test_sparkSQL.R#1058): column functions 
-
result not equal to expected
Names: 1 string mismatch
Error: Test failures
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-24 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-188138539
  
@sun-rui @shivaram Do you know which version of R and SparkR's dependencies 
are being used in Jenkins? Tests run fine in my environment (have reviewed my 
code and ran unit tests many times). Wondering if that's due to a different 
version of R, testthat, devtools, rJava, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-23 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-188046181
  
@AmplabJenkins Jenkins, could you retest please? I see ERROR: Error 
fetching remote repo 'origin'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-23 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-187996241
  
Jenkins, retest please. All tests pass for me after checking out this 
branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-23 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11336#issuecomment-187996142
  
Thanks @falaki!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9325][SPARK-R] collect() head() and sho...

2016-02-23 Thread olarayej
GitHub user olarayej opened a pull request:

https://github.com/apache/spark/pull/11336

[SPARK-9325][SPARK-R] collect() head() and show() for Columns

See attached design document

[SparkR collect (JIRA 
doc).pdf](https://github.com/apache/spark/files/143656/SparkR.collect.JIRA.doc.pdf)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olarayej/spark SPARK-9325

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11336


commit 6fc97e02975909cb72e27077aac97d4f90b332d5
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com>
Date:   2016-02-23T23:48:55Z

Support for collect() on Columns

commit fbf9b02b478b8eb4845232e09932d068cb393fd8
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com>
Date:   2016-02-24T00:00:52Z

Removed drop=F from other PR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436][SPARKR]

2016-02-22 Thread olarayej
GitHub user olarayej opened a pull request:

https://github.com/apache/spark/pull/11318

[SPARK-13436][SPARKR]

## What changes were proposed in this pull request?

Added parameter drop to subsetting operator [. Refer to R's documentation 
for the behavior of parameter drop.

## How was the this patch tested?

Ran all unit tests. Added drop=F to some of the tests where a DataFrame was 
required as opposed to a Column.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olarayej/spark SPARK-13436

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11318


commit e8c64156e468105a4323f59d1ece87c8fb6662f4
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com>
Date:   2016-02-16T18:14:40Z

Added parameter validations for colnames<-

commit 07e541b7e55a322ea7c74e230ee897ebe9584197
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net>
Date:   2016-02-22T18:16:21Z

Added one test for replacing . with _ in column names assignment

commit 9fa2f5f13b0e6bb9a7ad9b53fc48c6694e49565a
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net>
Date:   2016-02-23T02:54:50Z

Added drop parameter to subsetting operator. Rewrote [ as one single method

commit 632a81fe89b0f461b8084f6f2048046eb17c04b0
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.attlocal.net>
Date:   2016-02-23T02:58:29Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-13436




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13436] [Mesos] Document that spark.meso...

2016-02-22 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11311#issuecomment-187425038
  
@mgummelt I'm guessing you got the wrong PR number? Could you please fix 
this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-22 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r53701739
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
+  signature(x = "DataFrame"),
   function(x, value) {
+
+# Check parameter integrity
+if (class(value) != "character") {
+  stop("Invalid column names.")
+}
+
+if (length(value) != ncol(x)) {
+  stop(
+"Column names must have the same length as the number of 
columns in the dataset.")
+}
+
+if (any(is.na(value))) {
+  stop("Column names cannot be NA.")
+}
+
+# Check if the column names have . in it
+if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) {
--- End diff --

Done. @sun-rui, @felixcheung. Shall we merge this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-22 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11220#issuecomment-187318737
  
Jenkins, retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r53504008
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
+  signature(x = "DataFrame"),
   function(x, value) {
+
+# Check parameter integrity
+if (class(value) != "character") {
+  stop("Invalid column names.")
+}
+
+if (length(value) != ncol(x)) {
+  stop(
+"Column names must have the same length as the number of 
columns in the dataset.")
+}
+
+if (any(is.na(value))) {
+  stop("Column names cannot be NA.")
+}
+
+# Check if the column names have . in it
+if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) {
--- End diff --

@felixcheung Not sure I follow your idea. Is this what you refer to?

# Note: if this test is broken, remove check for "." character on 
colnames<- method
expect_equal(colnames(irisDF)[1] == "Sepal_Length"))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-18 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r53375517
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
--- End diff --

The reason I added this was because of this:

> colnames(iris) <- 1
Error in `colnames<-`(`*tmp*`, value = 1) : 
  'dimnames' applied to non-array

After I ran that, I saw this:
> showMethods("colnames<-")
Function: colnames<- (package SparkR)
x="ANY", value="ANY"
x="DataFrame", value="character"
x="DataFrame", value="numeric"
(inherited from: x="ANY", value="ANY")

So looks like R automatically adds definitions of colnames<- if value is 
other than character.

This does not happen with coltypes<-, as it's not part of base package and 
doesn't have an (ANY,ANY) signature.

> coltypes(irisDF) <- 1
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘coltypes<-’ for 
signature ‘"DataFrame", "numeric"’


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-18 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r53373470
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
+  signature(x = "DataFrame"),
   function(x, value) {
+
+# Check parameter integrity
+if (class(value) != "character") {
+  stop("Invalid column names.")
+}
+
+if (length(value) != ncol(x)) {
+  stop(
+"Column names must have the same length as the number of 
columns in the dataset.")
+}
+
+if (any(is.na(value))) {
+  stop("Column names cannot be NA.")
+}
+
+# Check if the column names have . in it
+if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) {
--- End diff --

@felixcheung @sun-rui Thanks for your input. Right now if I assign column 
names containing "." character, any subsequent operation on the DataFrame will 
fail.

Now, regarding @felixcheung's comment on the test case, right now there are 
two test cases with str() and with() expecting colnames of iris to be 
"Sepal_Length", ..., etc. Those will be broken when they fix SPARK-11976. No 
need to add more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-02-16 Thread olarayej
GitHub user olarayej opened a pull request:

https://github.com/apache/spark/pull/11220

[SPARK-13327][SPARKR] Added parameter validations for colnames<-



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olarayej/spark SPARK-13312-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11220


commit e8c64156e468105a4323f59d1ece87c8fb6662f4
Author: Oscar D. Lara Yejas <odlar...@oscars-mbp.usca.ibm.com>
Date:   2016-02-16T18:14:40Z

Added parameter validations for colnames<-




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-15 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-172054570
  
Thanks, @shivaram!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-13 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-171441589
  
@felixcheung Done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-12 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r49493628
  
--- Diff: R/pkg/R/generics.R ---
@@ -581,6 +579,10 @@ setGeneric("unionAll", function(x, y) { 
standardGeneric("unionAll") })
 #' @export
 setGeneric("where", function(x, condition) { standardGeneric("where") })
 
+#' @rdname with
+#' @export
+setGeneric("with")
--- End diff --

Fixed this and also re-ordered generics declaration for attach and 
as.data.frame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-12 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-171013162
  
Jenkins, could you retest please? The error I see is "Error fetching remote 
repo 'origin'"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-12 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-171101731
  
@SparkQA Could you retest?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2016-01-11 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-170624621
  
Happy New Year, folks! Shall we merge this PR? @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-21 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-166544706
  
@shivaram I have addressed all your comments. Should we close this pull 
request?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-18 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r48079567
  
--- Diff: R/pkg/R/generics.R ---
@@ -509,13 +520,8 @@ setGeneric("saveAsTable", function(df, tableName, 
source, mode, ...) {
   standardGeneric("saveAsTable")
 })
 
-#' @rdname withColumn
-#' @export
-setGeneric("transform", function(`_data`, ...) 
{standardGeneric("transform") })
--- End diff --

This has been fixed! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-18 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-165923895
  
@felixcheung Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-17 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-165607267
  
@shivaram I have removed the caching logic as you indicated
@felixcheung @sun-rui I have already explained why we can't use R's str() 
function under the covers.
Any more comments? Otherwise, should we merge?
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-17 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r47984121
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2151,3 +2151,97 @@ setMethod("coltypes",
 
 rTypes
   })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str
+#' @family DataFrame functions
+#' @param object a DataFrame
+#' @examples \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+localDF <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)])
+ } else {
+   head(object)
+ }
+
+# The number of observations will be displayed only if the 
number
+# of rows of the dataset has already been cached.
+if (!is.null(cachedCount)) {
+  cat(paste0("'", class(object), "': ", cachedCount, " obs. of 
",
+length(names), " variables:\n"))
+} else {
+  cat(paste0("'", class(object), "': ", length(names), " 
variables:\n"))
+}
+
+# Whether the ... should be printed at the end of each row
+ellipsis <- FALSE
+
+# Add ellipsis (i.e., "...") if there are more rows than shown
+if (!is.null(cachedCount) && (cachedCount > 6)) {
+  ellipsis <- TRUE
+}
+
+if (nrow(localDF) > 0) {
+  for (i in 1 : ncol(localDF)) {
+firstElements <- ""
--- End diff --

I have fixed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-08 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r46995766
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2151,3 +2151,97 @@ setMethod("coltypes",
 
 rTypes
   })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str
+#' @family DataFrame functions
+#' @param object a DataFrame
+#' @examples \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+localDF <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)])
+ } else {
+   head(object)
+ }
+
+# The number of observations will be displayed only if the 
number
+# of rows of the dataset has already been cached.
+if (!is.null(cachedCount)) {
+  cat(paste0("'", class(object), "': ", cachedCount, " obs. of 
",
+length(names), " variables:\n"))
+} else {
+  cat(paste0("'", class(object), "': ", length(names), " 
variables:\n"))
+}
+
+# Whether the ... should be printed at the end of each row
+ellipsis <- FALSE
+
+# Add ellipsis (i.e., "...") if there are more rows than shown
+if (!is.null(cachedCount) && (cachedCount > 6)) {
+  ellipsis <- TRUE
+}
+
+if (nrow(localDF) > 0) {
+  for (i in 1 : ncol(localDF)) {
+firstElements <- ""
+
+# Get the first elements for each column
+if (types[i] == "character") {
+  firstElements <- paste(paste0("\"", localDF[,i], "\""), 
collapse = " ")
+} else {
+  firstElements <- paste(localDF[,i], collapse = " ")
+}
+
+# Add the corresponding number of spaces for alignment
+spaces <- paste(rep(" ", max(nchar(names) - 
nchar(names[i]))), collapse="")
+
+# Get the short type. For 'character', it would be 'chr';
+# 'for numeric', it's 'num', etc.
--- End diff --

Combining those two lines will end up in 106 characters


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-12-07 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-162725183
  
@felixcheung, @sun-rui As I mentioned in my previous comment, it's not only 
replacing data.frame for DataFrame in the header. There are also issues with 
the number of rows and data types (complex ones). For example:

> x <- createDataFrame(sqlContext, list(list(as.environment(
list("a"="b", "c"="d", "e"="f")

> str(x)
'DataFrame': 1 obs. of 1 variables:
 $ _1: map 

> str(as.dataframe(x))
'data.frame':   1 obs. of  1 variable:
 $ _1:List of 1
  ..$ : 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-24 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-159366139
  
@shivaram Any further comments or clarification on the existing ones 
required from my end? Otherwise, should we merge this PR? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-20 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r45537232
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2199,3 +2199,97 @@ setMethod("coltypes",
 
 rTypes
   })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str
+#' @family DataFrame functions
+#' @param object a DataFrame
+#' @examples \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+dataFrame <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)])
+ } else {
+   head(object)
+ }
+
+# The number of observations will be displayed only if the 
number
+# of rows of the dataset has already been cached.
+if (!is.null(cachedCount)) {
--- End diff --

Yes, that's why I added the TODO. In our implementation, we had a global 
cache to store the number of rows of the datasets in the current session. At 
some point, we'll need to implement some caching mechanism so that every time 
you run str() or nrow(), you don't have to do a full data scan. When such 
caching mechanism is implemented, all we'll need to do is to change this line 
accordingly:

cachedCount <- nrow(object)

by

cachedCount <- FUNCTION_TO_GET_CACHED_NROW(object)

The behavior of str() is such that if nrow() hasn't been cached, the number 
of rows is simply not shown.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-20 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r45537282
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2199,3 +2199,97 @@ setMethod("coltypes",
 
 rTypes
   })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str
+#' @family DataFrame functions
+#' @param object a DataFrame
+#' @examples \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+dataFrame <- if (ncol(object) > MAX_COLS) {
--- End diff --

Good idea. Let me change that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-20 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r45537532
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2199,3 +2199,97 @@ setMethod("coltypes",
 
 rTypes
   })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str
+#' @family DataFrame functions
+#' @param object a DataFrame
+#' @examples \dontrun{
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+dataFrame <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)])
+ } else {
+   head(object)
+ }
+
+# The number of observations will be displayed only if the 
number
+# of rows of the dataset has already been cached.
+if (!is.null(cachedCount)) {
+  cat(paste0("'", class(object), "': ", cachedCount, " obs. of 
",
+length(names), " variables:\n"))
+} else {
+  cat(paste0("'", class(object), "': ", length(names), " 
variables:\n"))
+}
+
+# Whether the ... should be printed at the end of each row
+ellipsis <- FALSE
+
+# Add ellipsis (i.e., "...") if there are more rows than shown
+if (!is.null(cachedCount) && (cachedCount > 6)) {
+  ellipsis <- TRUE
+}
+
+if (nrow(dataFrame) > 0) {
--- End diff --

Good point :-). I thought about that before but I realized there are three 
issues:
1) Header is different (DataFrame vs data.frame)
2) Number of rows would not match, and in some cases we don't wanna show it.
3) We're still not clear in the mapping between column types of DataFrame 
and data.frame

I added a comment on JIRA SPARK-10863 (see link below). If we implemented 
corresponding data types in R, we could leverage some part of utils:::str() in 
SparkR:::str().

https://issues.apache.org/jira/browse/SPARK-10863


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-19 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-158241832
  
Jenkins, could you retest please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-19 Thread olarayej
Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r45413626
  
--- Diff: R/pkg/R/generics.R ---
@@ -561,6 +579,10 @@ setGeneric("unionAll", function(x, y) { 
standardGeneric("unionAll") })
 #' @export
 setGeneric("where", function(x, condition) { standardGeneric("where") })
 
+#' @rdname with
--- End diff --

Nice catch. I have fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-18 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-157786743
  
@felixcheung @shivaram Any more comments, folks? Otherwise, can we merge 
this? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-18 Thread olarayej
Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/9613#issuecomment-157903598
  
@shivaram I have updated this branch with master. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >