[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13684


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67412795
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1949,14 +1950,24 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
   signature(x = "SparkDataFrame"),
-  function(x, colNames = columns(x)) {
-stopifnot(class(colNames) == "character")
-
-sdf <- callJMethod(x@sdf, "dropDuplicates", as.list(colNames))
+  function(x, ...) {
+cols <- list(...)
+if (length(cols) == 0) {
+  sdf <- callJMethod(x@sdf, "dropDuplicates", 
as.list(columns(x)))
+} else {
+  if (!all(sapply(cols, function(c) { is.character(c) }))) { 
stop }
--- End diff --

It's done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67412381
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1949,14 +1950,24 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
   signature(x = "SparkDataFrame"),
-  function(x, colNames = columns(x)) {
-stopifnot(class(colNames) == "character")
-
-sdf <- callJMethod(x@sdf, "dropDuplicates", as.list(colNames))
+  function(x, ...) {
+cols <- list(...)
+if (length(cols) == 0) {
+  sdf <- callJMethod(x@sdf, "dropDuplicates", 
as.list(columns(x)))
+} else {
+  if (!all(sapply(cols, function(c) { is.character(c) }))) { 
stop }
--- End diff --

Oh, sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-16 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67412216
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1949,14 +1950,24 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
   signature(x = "SparkDataFrame"),
-  function(x, colNames = columns(x)) {
-stopifnot(class(colNames) == "character")
-
-sdf <- callJMethod(x@sdf, "dropDuplicates", as.list(colNames))
+  function(x, ...) {
+cols <- list(...)
+if (length(cols) == 0) {
+  sdf <- callJMethod(x@sdf, "dropDuplicates", 
as.list(columns(x)))
+} else {
+  if (!all(sapply(cols, function(c) { is.character(c) }))) { 
stop }
--- End diff --

Can we print a error message in `stop` here ? Something like `stop("all 
columns names should be characters")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread sun-rui
Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67287926
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1856,10 +1856,11 @@ setMethod("where",
 #' the subset of columns.
 #'
 #' @param x A SparkDataFrame.
-#' @param colnames A character vector of column names.
+#' @param ... A character vector of column names or string column names.
+#' If the first argument contains a character vector then the following 
column names are ignored.
--- End diff --

align this line to "A character ..." of the above line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67280096
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1869,14 +1871,22 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
   signature(x = "SparkDataFrame"),
-  function(x, colNames = columns(x)) {
-stopifnot(class(colNames) == "character")
-
-sdf <- callJMethod(x@sdf, "dropDuplicates", as.list(colNames))
+  function(x, col, ...) {
--- End diff --

Thank you for review, @sun-rui !
I'll improve like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67234728
  
--- Diff: R/pkg/R/generics.R ---
@@ -462,12 +462,9 @@ setGeneric("describe", function(x, col, ...) { 
standardGeneric("describe") })
 #' @export
 setGeneric("drop", function(x, ...) { standardGeneric("drop") })
 
-#' @rdname dropduplicates
+#' @rdname dropDuplicates
 #' @export
-setGeneric("dropDuplicates",
-   function(x, colNames = columns(x)) {
- standardGeneric("dropDuplicates")
-   })
+setGeneric("dropDuplicates", function(x, col, ...) { 
standardGeneric("dropDuplicates") })
--- End diff --

hmm - putting `col` in the generic means that it will always be required ? 
I'm not sure how the code path with no arguments will work then. I think we can 
leave the generic as `dropDuplicates(x, ...)` and then have a more specific 
implementation. (See `na.omit` for example)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67233674
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1869,14 +1871,23 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
   signature(x = "SparkDataFrame"),
-  function(x, colNames = columns(x)) {
-stopifnot(class(colNames) == "character")
-
-sdf <- callJMethod(x@sdf, "dropDuplicates", as.list(colNames))
+  function(x, col, ...) {
+if (nargs() == 1) {
+  sdf <- callJMethod(x@sdf, "dropDuplicates", 
as.list(columns(x)))
+}
+else {
--- End diff --

style nit: the else goes on the previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67213090
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1869,6 +1869,7 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
--- End diff --

I'll try to merge them into one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67210755
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1869,6 +1869,7 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
--- End diff --

Actually, I kept the existing `dropDuplicates` since it handles 
`dropDuplicates(df)` for all columns, too.
Don't we still need two functions if we move the case "c(...)"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67209107
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1859,7 +1859,7 @@ setMethod("where",
 #' @param colnames A character vector of column names.
--- End diff --

Yep. Right. I will add that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67208482
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1859,7 +1859,7 @@ setMethod("where",
 #' @param colnames A character vector of column names.
--- End diff --

That sounds good. Would also be good to include `if col contains a 
character vector then additional column names are ignored` (since thats what 
the implementation is doing from what I see ?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67208357
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1869,6 +1869,7 @@ setMethod("where",
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
 #' dropDuplicates(df)
+#' dropDuplicates(df, "col1", "col2")
 #' dropDuplicates(df, c("col1", "col2"))
 #' }
 setMethod("dropDuplicates",
--- End diff --

Since we have unified the generics, can we also fold this method into the 
other one ? i.e. move the code here to the `length(col) > 1` case in the other 
function 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67206540
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1859,7 +1859,7 @@ setMethod("where",
 #' @param colnames A character vector of column names.
--- End diff --

Oh, thank you for review, @shivaram .
Sure. I'll update the doc. Maybe, something like the following?
```
- #' @param colnames A character vector of column names.
+ #' @param col A character vector of column names or a string of column 
name
+ #' @param ... Additional column names
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13684#discussion_r67205381
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1859,7 +1859,7 @@ setMethod("where",
 #' @param colnames A character vector of column names.
--- End diff --

can we update this to indicate that we accept the varargs style ? I think 
we can document `...` as a `@param` for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13684: [SPARK-15908][R] Add varargs-type dropDuplicates(...

2016-06-15 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/13684

[SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR

## What changes were proposed in this pull request?

This PR adds varargs-type `dropDuplicates` function to SparkR for API 
parity. 
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.

## How was this patch tested?

Pass the Jenkins tests with new testcases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-15908

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13684.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13684


commit f1d6355af9dc8e782680a1fc3fac07f8ca31b82b
Author: Dongjoon Hyun 
Date:   2016-06-15T10:08:28Z

[SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org