Re: R Interpreter - PR 702

enzo Sun, 13 Mar 2016 14:28:05 -0700

Just a side consideration on package management, expanding some of Jaff’s 
comments.


As well all know R is in essence a “single user” tool.  I assume at the 
beginning Zeppelin will be the same.

On my machine I have two libraries where my R packages are stored, a system 
library and a user library.

I imagine Zeppelin will “discover” my personal user library and of course the 
system library.  As such there is no need for anything else.  Any package 
necessary for the R interpreter(s) could be stored in the user’s library.  

This would work, but in principle a user could still download rscala or similar 
packages to use for example with Rstudio or in a parallel instance of Jupyter 
(using IRkernel), hence the way Zeppelin will manage these will have to be 
error-proof (maybe managing a Zeppelin library, not accessible / updatable by 
RStudio or R GUI on the system?  I am not sure what is the design of choice of 
PR 702 or PR 208).

While maybe premature, I think we need also to discuss what is going to happen 
in case of a Zeppelin Server, serving many users. This extends a bit normal 
operating patterns for R.  I don’t think in such a case we should expect to 
have a R administrator managing packages outside of Zeppelin (does Zeppelin 
plan to have in such a case an admin interface??). 

In case of Zeppelin Server different scenarios could be applied:
A possibility would be to have a single library dedicated to the server where 
all users share all packages. The issue would be how to maintain it, and what 
if different users will require different versions of some package?  While it 
may appear cumbersome, it may be appropriate in environments where they plan to 
control rigidly which version of a package is applied when.
Possibly the most functionally complete approach would be to have a library per 
users (plus probably a “private” library for Zeppelin - or dedicated packages 
copied into each library, as currently done by Rstudio with rstudioapi).

What are the plans  / ideas on this for PR 208 / 702?

It would be interesting to have the re-assurance that whichever design there 
will be flexibility to implement different approached in the future...



Enzo
e...@smartinsightsfromdata.com



> On 9 Mar 2016, at 18:56, Jeff Steinmetz <jeffrey.steinm...@gmail.com> wrote:
> 
> The package management thoughts I presented could be considered general 
> suggestions for any R interpreter improvement, with implications beyond just 
> rScala.
> 
> Researching options to lock down package management in the R notebook were a 
> suggestions I raised, which I wouldn’t consider to be a show stoppers for 
> getting an R interpreter off the ground as a first step.  
> Although, if we came up with an awesome solution to help buckle down R 
> package management in Zeppelin, there is no reason not to give it a try or 
> discuss its utility.  
> R Interpreter functionality and security can mature over time via small 
> iterative improvements and collaboration.
> 
> 
> 
> 
> Cheers, Jeff
> 
> 
> On 3/9/16, 10:06 AM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
> 
>> Eric, denying what happened is just silly. Theres a whole record of it.
>> 
>> When you started, your code downloaded the latest rscala. Your code broke 
>> entirely when rscala was updated. You scrambled to change it, and at the 
>> same time you started bundling the binary so you could control the version 
>> of the rscala jar, which you mistakenly thought would prevent it from 
>> breaking again. Then you found out (from me) about the R library issue, so 
>> you started letting the user pick the rscala version at build time. Which 
>> still doesn't fix it.
>> 
>> Now, you say that implementing your solution requires an administrator to 
>> lock-down the machine running Zeppelin, etc etc etc --- but to what benefit? 
>>  There's no way to take advantage of new rscala features anyway. 
>> 
>> Regarding pyspark, we have to maintain version parity with spark. Python 
>> also doesn't have r's library management system. 
>> 
>> Why are you still defending this? It was a poor design decision. Move on.
>> 
>>> On Mar 9, 2016, at 1:05 AM, Eric Charles <e...@apache.org> wrote:
>>> 
>>> 
>>>> On 09/03/16 06:41, Amos B. Elberg wrote:
>>>> That's not true eric. When rscala was updated to 1.0.8, your interpreter 
>>>> broke entirely. You then rushed to fix it, and that's when you began 
>>>> including the binary in the distribution. This is all in the commit logs.
>>> 
>>> I always shipped or downloaded the rscala jar...
>>> 
>>>> I recognize that you now allow the user to select an rscala version at 
>>>> build time, which means they have to compile Zeppelin for a specific 
>>>> rscala version.
>>> 
>>> User has not to choose. Packager and devops will do it for the user and 
>>> user should not have the permission to install/update any package.
>>> 
>>>> What's the point? What you've achieved is to replace 3 short source files, 
>>>> by introducing a proven instability, a maintenance burden on the user, and 
>>>> a support burden on us when 200 people show up with obscure error messages 
>>>> that have to be diagnosed.
>>> 
>>> Let me take the analogy with the py4j jar which fulfills a similar role as 
>>> rscala jar, the role of binder between scala and another language.
>>> 
>>> With spark 1.6, the pyspark version has been updated. If Zeppelin had 
>>> shipped the source code, zeppelin developer would have had the 
>>> responsibility to update the complete source code of pyspark.
>>> 
>>> In our case (relying on external jar), upgrading from 0.8.2.1 to 0.9 was 
>>> easy:
>>> 
>>> https://github.com/apache/incubator-zeppelin/pull/463/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR320
>>> 
>>> That approach has also the enormous advantage to support not aonly 
>>> different rscala versions but also different scala profiles (2.10, 2.11...).
>>> 
>>>>> On Mar 9, 2016, at 12:22 AM, Eric Charles <e...@apache.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 09/03/16 06:05, Amos B. Elberg wrote:
>>>>>> Jeff you're correct that when Zeppelin is being professionally  
>>>>>> administered, the administrator can take care of all of this.
>>>>>> 
>>>>>> But why create an additional system administration task? And what about 
>>>>>> users without professional systems admins?
>>>>>> 
>>>>>> The only "benefit" to doing it that way, is we bundle a binary instead 
>>>>>> of source, when the source is likely to never need updating. That 
>>>>>> doesn't seem like a "benefit" at all.
>>>>>> 
>>>>>> And in exchange for that, the cost is things like having to lock down R 
>>>>>> or prevent package updates?  That doesn't make much sense to me.
>>>>>> 
>>>>>> The question of which method has more overhead has been answered 
>>>>>> empirically: this issue already broke 702, and there have been a whole 
>>>>>> series of revisions to it to address various issues, with no end in 
>>>>>> sight.
>>>>> 
>>>>> Amos, You mention a few times that 'the issue broke 702...'. I don't see 
>>>>> when and why that particular approach broke anything.
>>>>> 
>>>>> I certainly experimented and reported that the rscala version alignment 
>>>>> is important, like any version of linux, jdk, scala, spark... dependency.
>>>>> 
>>>>> What I changed, after Moon comment, is the download of the jar at build 
>>>>> time instead of shipping the jar in the source tree. This approach has 
>>>>> the advantage of being able to define at build time which version you 
>>>>> want.
>>>>> 
>>>>>> Meanwhile, this part of 208 has been stable for six months, without a 
>>>>>> single user issue.
>>>>>> 
>>>>>> 
>>>>>>> On Mar 8, 2016, at 8:34 PM, Jeff Steinmetz 
>>>>>>> <jeffrey.steinm...@gmail.com> wrote:
>>>>>>> 
>>>>>>> During my tests, I found the rScala installation its dependencies to be 
>>>>>>> manageable - even though it may not be ideal (i.e. the source is not 
>>>>>>> included).
>>>>>>> The Zeppelin build already needs to target the correct version of Spark 
>>>>>>> + Hadoop so for this exercise, I treat rScala similarly, as an 
>>>>>>> additional build and install consideration.
>>>>>>> 
>>>>>>> I’m looking at this through a specific lens:
>>>>>>> “As a technology decision maker in a company, could I and would I 
>>>>>>> deploy this in our environment? Could I work with our Data Scientist 
>>>>>>> team to implement its usage?  Would the Data Engineering and Data 
>>>>>>> Science team that commonly use R find it useful?”
>>>>>>> 
>>>>>>> Inadvertent rScala updates could be minimized with education, letting 
>>>>>>> the users know that R package management within a notebook should be 
>>>>>>> avoided.  Which generally seems like a good idea regardless of how 
>>>>>>> R-Zeppelin is implemented since it’s a shared environment (you don’t 
>>>>>>> want to break other users graphs with an rGraph update or uninstall 
>>>>>>> ggplot2, etc, etc.)
>>>>>>> 
>>>>>>> Even better - what if there was a zeppelin config that disabled 
>>>>>>> `install.packages()`, `remove.packages()` and `update.packages()`.  
>>>>>>> This would allow package installation to be carried out only by 
>>>>>>> administrators or devops outside of Zeppelin.
>>>>>>> Although its not clear on the effort vs. benefit, I’m sure somebody 
>>>>>>> crafty could come up with a way around this with a convoluted Eval or 
>>>>>>> running something through the shell in Zeppelin.
>>>>>>> 
>>>>>>> R, Python and Scala all have pretty wide open door to parts of the 
>>>>>>> underlying operating system.
>>>>>>> A 100% bullet proof way to locking “everything" down is a tough 
>>>>>>> challenge.
>>>>>>> 
>>>>>>> ----
>>>>>>> Jeff Steinmetz
>>>>>>> Principal Architect
>>>>>>> Akili Interactive
>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 3/8/16, 12:16 PM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Jeff - one of the problems with the rscala approach in 702 is it 
>>>>>>>> doesn't take into account the R library. If rscala gets updated, the 
>>>>>>>> user will likely download and update it automatically when they call 
>>>>>>>> update.packages(). The result will be that the version of the rscala R 
>>>>>>>> package doesn't match the version of the rscala jar, and the 
>>>>>>>> interpreter will fail. Or, if the jar is also updated, it will simply 
>>>>>>>> break 702.
>>>>>>>> 
>>>>>>>> This has happened to 702 already-702 changed its library management 
>>>>>>>> because an update to rscala broke the prior version. Actually though, 
>>>>>>>> every rscala update is going to break 702.
>>>>>>>> 
>>>>>>>> Regarding return values from the interpreter, the norm in R is that 
>>>>>>>> the return value of the last expression is shown in a native format. 
>>>>>>>> So, if the result is a data frame, a dataframe should be visualized. 
>>>>>>>> If the result is an html object, it should be interpreted as html by 
>>>>>>>> the browser.  Do you disagree?
>>>>>>>> 
>>>>>>>>> On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz 
>>>>>>>>> <jeffrey.steinm...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> RE 702:
>>>>>>>>> I wanted to respond to Eric’s discussion from December 30.
>>>>>>>>> 
>>>>>>>>> I finally had some time to put aside a good chunk of dedicated, 
>>>>>>>>> uninterrupted time.
>>>>>>>>> This means I had a chance to “really” dig into this with a Data 
>>>>>>>>> Science R developer hat on.
>>>>>>>>> I also thought about this from a DevOps point of view (deploying in 
>>>>>>>>> an EC2 cluster, standalone, locally, VM).
>>>>>>>>> I tested it with a spark installation outside of the zeppelin build - 
>>>>>>>>> as if it was running on a cluster or standalone install.
>>>>>>>>> 
>>>>>>>>> I also had a chance to dig under the hood a bit, and explore what the 
>>>>>>>>> Java/Scala code in PR 702 is doing.
>>>>>>>>> 
>>>>>>>>> I like the simplicity of this PR (the source code and approach).
>>>>>>>>> 
>>>>>>>>> Works as expected, all graphic works, interactive charts works.
>>>>>>>>> 
>>>>>>>>> I also see your point about Rendering the text result vs TABLE plot 
>>>>>>>>> when the R interpreter result is a data frame.
>>>>>>>>> To confirm - the approach is to use  %sql to display it in a native 
>>>>>>>>> Zeppelin visualization.
>>>>>>>>> 
>>>>>>>>> Your approach makes sense, since this in line with how this works in 
>>>>>>>>> other Zeppelin work flows.
>>>>>>>>> I suppose you could add an R interpreter function, such as: 
>>>>>>>>> z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a 
>>>>>>>>> %table without having to jump to %sql (perhaps a nice addition in 
>>>>>>>>> this or a future PR).
>>>>>>>>> 
>>>>>>>>> It’s GREAT that %r print('%html') works with the Zeppelin display 
>>>>>>>>> system!  (as well as the other display system methods)
>>>>>>>>> 
>>>>>>>>> Regarding rscala jar.  You have a profile that will allow us to sync 
>>>>>>>>> up the version rscala, so that makes sense as well.
>>>>>>>>> This too worked as expected.  I specifically installed rscala (as you 
>>>>>>>>> describe in your docs) in the VM with:
>>>>>>>>> 
>>>>>>>>> curl 
>>>>>>>>> https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz
>>>>>>>>>  -o /tmp/rscala_1.0.6.tar.gz
>>>>>>>>> R CMD INSTALL /tmp/rscala_1.0.6.tar.gz
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Installing rscala outside of the Zeppelin dependencies does seem to 
>>>>>>>>> keep this PR simpler, and reduces the licensing overhead required to 
>>>>>>>>> get this PR through (based on comments I see from others)
>>>>>>>>> 
>>>>>>>>> I would need to add the two rscala install lines above to PR#751 (I 
>>>>>>>>> will add this today)
>>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/751
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regarding the Interpreters.   Just having %r as the our first 
>>>>>>>>> interpreter keyword makes sense.   Loading knitr within the 
>>>>>>>>> interpreter to enable rendering (versus having a %knitr interpreter 
>>>>>>>>> specifically) seems to keep things simple.
>>>>>>>>> 
>>>>>>>>> In summary - Looks good since everything in your sample R notebook 
>>>>>>>>> (as well as a few other tests I tried) worked for me using the VM 
>>>>>>>>> script in PR#751.
>>>>>>>>> The documentation also facilitated a smooth installation and allowed 
>>>>>>>>> me to create a repeatable script, that when paired with the VM worked 
>>>>>>>>> as expected.
>>>>>>>>> 
>>>>>>>>> ----
>>>>>>>>> Jeff Steinmetz
>>>>>>>>> Principal Architect
>>>>>>>>> Akili Interactive
>>>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> From
>>>>>>>>>> Eric Charles <e...@apache.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Subject
>>>>>>>>>> [DISCUSS] PR #208 - R Interpreter for Zeppelin
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Date
>>>>>>>>>> Wed, 30 Dec 2015 14:04:33 GMT
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I had a look at https://github.com/apache/incubator-zeppelin/pull/208
>>>>>>>>>> (and related Github repo https://github.com/elbamos/Zeppelin-With-R 
>>>>>>>>>> [1])
>>>>>>>>>> 
>>>>>>>>>> Here are a few topics for discussion based on my experience 
>>>>>>>>>> developing
>>>>>>>>>> https://github.com/datalayer/zeppelin-R [2].
>>>>>>>>>> 
>>>>>>>>>> 1. rscala jar not in Maven Repository
>>>>>>>>>> 
>>>>>>>>>> [1] copies the source (scala and R) code from rscala repo and
>>>>>>>>>> changes/extends/repackages it a bit. [2] declares the jar as system
>>>>>>>>>> scoped library. I recently had incompatibly issues between the 1.0.8
>>>>>>>>>> (the one you get since 2015-12-10 when you install rscala on your R
>>>>>>>>>> environment) and the 1.0.6 jar I am using part of the zeppelin-R 
>>>>>>>>>> build.
>>>>>>>>>> To avoid such issues, why not the user choosing the version via a
>>>>>>>>>> property at build time to fit the version he runs on its host? This 
>>>>>>>>>> will
>>>>>>>>>> also allow to benefit from the next rscala releases which fix bugs,
>>>>>>>>>> bring not features... This also means we don't have to copy the 
>>>>>>>>>> rscala
>>>>>>>>>> code in Zeppelin tree.
>>>>>>>>>> 
>>>>>>>>>> 2. Interpreters
>>>>>>>>>> 
>>>>>>>>>> [1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are
>>>>>>>>>> implemented in their own module apart from the Spark one. To be 
>>>>>>>>>> aligned
>>>>>>>>>> the existing pyspark implementation, why not integrating the R code 
>>>>>>>>>> into
>>>>>>>>>> the Spark one? Any reason to keep 2 versions which does basically the
>>>>>>>>>> same? The unique magic keyword would then be %spark.r
>>>>>>>>>> 
>>>>>>>>>> 3. Rendering TABLE plot when interpreter result is a dataframe
>>>>>>>>>> 
>>>>>>>>>> This may be confusing. What if I display a plot and simply want to 
>>>>>>>>>> print
>>>>>>>>>> the first 10 rows at the end of my code? To keep the same behavior as
>>>>>>>>>> the other interpreters, we could make this feature optional 
>>>>>>>>>> (disabled by
>>>>>>>>>> default, enabled via property).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thx, Eric
>>>>>>> 
> 
>

Re: R Interpreter - PR 702

Reply via email to