Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17161
  
    @felixcheung, please let me leave my humble opinion.
    
    In general, I like this change. Maybe, it should not necessarily be always 
runnable (at least for now) but I believe it is still an improvement to make 
them runnable, for example, it allows users to directly copy and paste the 
example and easily test.
    To me, this is more like the case that we (at least I) prefer the JIRA 
having a self-contained reproducer. Personally, I just copy and paste the 
example and check the input/output when I first learn and try new APIs as a 
user.
    
    > Firstly, I see this as slightly different from Python, in that in R it is 
common to have built-in datasets and possibly users are used to having them and 
having examples using them.
    
    In case of Python examples, up to my knowledge, it is also common to 
document self-runnable examples (e.g., 
[pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)
 and [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)). 
Moreover, they are tested in doctests so there is no overhead to check if they 
are runnable everytime. Maybe, let's talk about this in the PR or JIRA (I am 
sorry for dragging Python one into this).
    
    > I have done a pass on the changes in this PR and I'm happy with changing 
from non-existing json file to mtcars. I'm slightly concerned with the few 
cases of artificial 3 rows data (like here) - more on that below on small 
dataset.
    
    Maybe, we could take out the examples that you are concerned of for now..
    
    
    For the four concerns you described, I do agree in general but if it has a 
downside more than what it improves, I would disagree but to me it might sound 
the improvement might be better than the downsides about the change itself. 
    
    >how much work and how much change is it to change all examples (this is 
only 1 .R out of 20-something files we have, in a total of 300+ methods which 
is on the high side for R packages)
    
    I guess the first one is about reviewing cost/backporting/conflict concern. 
I am willing to help verify the examples by manually running even if they are 
so many in terms of reviewing cost if the change is big.
    
    > how much churn will it be to keep them up-to-date when we are having 
changes to API (eg. sparkR.session()); especially since in order to have 
examples self-contained we tend to add additional calls to manipulate data and 
thereby increasing the number of references of API calls; which could get stale 
easily like the example with insertInto
    
    For the second concern, if it makes the maintenance header, it'd be a 
important point but I think we are being conservative on the API changes from 
Spark 2.x. So, I guess we would less likely need to sweep these. I think it is 
okay even if it would be change in the near future if the changes are not quite 
big.
    
    > perhaps more importantly, how practical or useful it would be to use 
built-in datasets or native R data.frame (mtcars, cars, Titanic, iris, or make 
up some; that are super small) on a scalable data platform like Spark? perhaps 
it is better to demonstrate, in examples, how to work with external data 
sources, multiple file formats etc.?
    
    > and lastly, we still have about a dozen methods that are without example 
that are being flagged by CRAN checks (but not enough to fail it yet)
    
    For the third and fourth concerns, I think we could improve this by adding 
more examples, explaining and describing the details. For example, we could 
leave some comments about this such as .. "this is an example but in pratice 
blabla..". Maybe, we could do this in another PR maybe. Otherwise, we could 
leave the examples as they are with some proper comments.
    
    For other thoughts, I think it would be great if they are ran in the tests 
at lesat.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to