[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-16 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-728726963


   > Adding magrittr as a dependency
   
   Best saved for a different thread, but TL;DR: yes, I think that's a good 
idea, `magrittr` will help writing `SparkR` code that's more similar to the 
Python API (replace `.` with `%>%`). It's a lightweight dependency too. 
However, do note that R is planning to implement a native pipe in the 
not-so-distant future, see 
[here](https://www.tidyverse.org/blog/2020/08/magrittr-2-0/)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-16 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-728724666


   > Ideally, we'd like both to behave the same way (possibly resolving the 
problem with mutate), that's however a breaking change.
   
   A reasonable request (though definitely out of scope for this PR). My 
pushback would be that I assume `mutate` is designed to behave ~like `dplyr`, 
while the SQL auto-naming is designed to behave ~like SQL (whatever that means, 
the upshot is that I think users would expect consistency vis-a-vis 
Python/other APIs). So some divergence may be inevitable here.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-16 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-728186184


   e.g. in [`SPARK-31517`](https://issues.apache.org/jira/browse/SPARK-31517), 
the supplied width was 98:
   
   ```
   over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp"
   ```
   
   Not impossible to imagine such expressions ballooning eventually to >500, 
e.g. with (1) using long descriptive column names (2) partitioning by a few 
columns instead of just 1 (3) sorting by a few columns instead of just one (4) 
using different window function with a longer name.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-16 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-728182679


   then what happens when users send an expression with width>500?
   
   not likely but it's safer to just collapse the result
   
   On Mon, Nov 16, 2020, 11:40 AM S Daniel Zafar 
   wrote:
   
   > @MichaelChirico  , what's wrong with
   > just setting width.cutoff = 500L in deparse?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-16 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-728044077


   collapse is an aggregation method. input is length>1 & collapse guarantees
   length-1 output.
   
   what's happening is separate has variable return length -- usually it
   returns a length-1 character vector but when the expression is large it has
   length>1.
   
   the bug fixed in this PR arises because of logic written assuming the
   output is always length 1 (see also the referenced Jira)
   
   On Mon, Nov 16, 2020, 4:47 AM Hyukjin Kwon  wrote:
   
   > It makes sense from my look but I would like other reviewers to double
   > check. I speak R as a second language :-).
   > @zero323  and @danzafar
   >  would you mind taking a look when you guys
   > find some time?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-15 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-727768929


   @HyukjinKwon test added, please have a look



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

2020-11-15 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-727694593


   @HyukjinKwon unfortunately I've since switched companies so I'm going from 
memory. But the basic idea AIRI is that `mutate` tries to auto-infer column 
names when not supplied, and this causes issues with wide expressions. An easy 
test would be something like
   
   ```
   1+2+3+...+n
   ```
   
   testing with `n` until the expression gets large enough to cause an error



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] fix strategy for handling ... names in mutate

2020-11-10 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-725187311


   @felixcheung I think this PR was ready to merge, could you help to reopen?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MichaelChirico commented on pull request #28386: [SPARK-26199][SPARK-31517][R] fix strategy for handling ... names in mutate

2020-05-03 Thread GitBox


MichaelChirico commented on pull request #28386:
URL: https://github.com/apache/spark/pull/28386#issuecomment-623224648


   Thanks @felixcheung. both backports are directly from `base` R:
   
   
https://github.com/wch/r-source/blob/08ebf253e44e10bfb445f27b53b2a43bc7e6740d/src/library/base/R/utilities.R#L26-L27
   
   
https://github.com/wch/r-source/blob/08ebf253e44e10bfb445f27b53b2a43bc7e6740d/src/library/base/R/strwrap.R#L219-L229
   
   Does that mean we need to use the R license header on the `backports.R` file 
specifically?
   
   As for the file ordering, that's [handled by the `Collate` 
field](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#The-DESCRIPTION-file);
 I've adjusted the order & pushed.
   
   > There was a different way to do method signature compatibility (you should 
be able to find it), maybe it will work better.
   
   I don't follow this, could you ellaborate?
   
   Happy to add tests for `mutate`, could you please help point out where such 
tests should be added? I was a bit confused by the structure of tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org