The package management thoughts I presented could be considered general suggestions for any R interpreter improvement, with implications beyond just rScala.
Researching options to lock down package management in the R notebook were a suggestions I raised, which I wouldn’t consider to be a show stoppers for getting an R interpreter off the ground as a first step. Although, if we came up with an awesome solution to help buckle down R package management in Zeppelin, there is no reason not to give it a try or discuss its utility. R Interpreter functionality and security can mature over time via small iterative improvements and collaboration. Cheers, Jeff On 3/9/16, 10:06 AM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote: >Eric, denying what happened is just silly. Theres a whole record of it. > >When you started, your code downloaded the latest rscala. Your code broke >entirely when rscala was updated. You scrambled to change it, and at the same >time you started bundling the binary so you could control the version of the >rscala jar, which you mistakenly thought would prevent it from breaking again. >Then you found out (from me) about the R library issue, so you started letting >the user pick the rscala version at build time. Which still doesn't fix it. > >Now, you say that implementing your solution requires an administrator to >lock-down the machine running Zeppelin, etc etc etc --- but to what benefit? >There's no way to take advantage of new rscala features anyway. > >Regarding pyspark, we have to maintain version parity with spark. Python also >doesn't have r's library management system. > >Why are you still defending this? It was a poor design decision. Move on. > >> On Mar 9, 2016, at 1:05 AM, Eric Charles <e...@apache.org> wrote: >> >> >>> On 09/03/16 06:41, Amos B. Elberg wrote: >>> That's not true eric. When rscala was updated to 1.0.8, your interpreter >>> broke entirely. You then rushed to fix it, and that's when you began >>> including the binary in the distribution. This is all in the commit logs. >> >> I always shipped or downloaded the rscala jar... >> >>> I recognize that you now allow the user to select an rscala version at >>> build time, which means they have to compile Zeppelin for a specific rscala >>> version. >> >> User has not to choose. Packager and devops will do it for the user and user >> should not have the permission to install/update any package. >> >>> What's the point? What you've achieved is to replace 3 short source files, >>> by introducing a proven instability, a maintenance burden on the user, and >>> a support burden on us when 200 people show up with obscure error messages >>> that have to be diagnosed. >> >> Let me take the analogy with the py4j jar which fulfills a similar role as >> rscala jar, the role of binder between scala and another language. >> >> With spark 1.6, the pyspark version has been updated. If Zeppelin had >> shipped the source code, zeppelin developer would have had the >> responsibility to update the complete source code of pyspark. >> >> In our case (relying on external jar), upgrading from 0.8.2.1 to 0.9 was >> easy: >> >> https://github.com/apache/incubator-zeppelin/pull/463/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR320 >> >> That approach has also the enormous advantage to support not aonly different >> rscala versions but also different scala profiles (2.10, 2.11...). >> >>>> On Mar 9, 2016, at 12:22 AM, Eric Charles <e...@apache.org> wrote: >>>> >>>> >>>> >>>>> On 09/03/16 06:05, Amos B. Elberg wrote: >>>>> Jeff you're correct that when Zeppelin is being professionally >>>>> administered, the administrator can take care of all of this. >>>>> >>>>> But why create an additional system administration task? And what about >>>>> users without professional systems admins? >>>>> >>>>> The only "benefit" to doing it that way, is we bundle a binary instead of >>>>> source, when the source is likely to never need updating. That doesn't >>>>> seem like a "benefit" at all. >>>>> >>>>> And in exchange for that, the cost is things like having to lock down R >>>>> or prevent package updates? That doesn't make much sense to me. >>>>> >>>>> The question of which method has more overhead has been answered >>>>> empirically: this issue already broke 702, and there have been a whole >>>>> series of revisions to it to address various issues, with no end in sight. >>>> >>>> Amos, You mention a few times that 'the issue broke 702...'. I don't see >>>> when and why that particular approach broke anything. >>>> >>>> I certainly experimented and reported that the rscala version alignment is >>>> important, like any version of linux, jdk, scala, spark... dependency. >>>> >>>> What I changed, after Moon comment, is the download of the jar at build >>>> time instead of shipping the jar in the source tree. This approach has the >>>> advantage of being able to define at build time which version you want. >>>> >>>>> Meanwhile, this part of 208 has been stable for six months, without a >>>>> single user issue. >>>>> >>>>> >>>>>> On Mar 8, 2016, at 8:34 PM, Jeff Steinmetz <jeffrey.steinm...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> During my tests, I found the rScala installation its dependencies to be >>>>>> manageable - even though it may not be ideal (i.e. the source is not >>>>>> included). >>>>>> The Zeppelin build already needs to target the correct version of Spark >>>>>> + Hadoop so for this exercise, I treat rScala similarly, as an >>>>>> additional build and install consideration. >>>>>> >>>>>> I’m looking at this through a specific lens: >>>>>> “As a technology decision maker in a company, could I and would I deploy >>>>>> this in our environment? Could I work with our Data Scientist team to >>>>>> implement its usage? Would the Data Engineering and Data Science team >>>>>> that commonly use R find it useful?” >>>>>> >>>>>> Inadvertent rScala updates could be minimized with education, letting >>>>>> the users know that R package management within a notebook should be >>>>>> avoided. Which generally seems like a good idea regardless of how >>>>>> R-Zeppelin is implemented since it’s a shared environment (you don’t >>>>>> want to break other users graphs with an rGraph update or uninstall >>>>>> ggplot2, etc, etc.) >>>>>> >>>>>> Even better - what if there was a zeppelin config that disabled >>>>>> `install.packages()`, `remove.packages()` and `update.packages()`. This >>>>>> would allow package installation to be carried out only by >>>>>> administrators or devops outside of Zeppelin. >>>>>> Although its not clear on the effort vs. benefit, I’m sure somebody >>>>>> crafty could come up with a way around this with a convoluted Eval or >>>>>> running something through the shell in Zeppelin. >>>>>> >>>>>> R, Python and Scala all have pretty wide open door to parts of the >>>>>> underlying operating system. >>>>>> A 100% bullet proof way to locking “everything" down is a tough >>>>>> challenge. >>>>>> >>>>>> ---- >>>>>> Jeff Steinmetz >>>>>> Principal Architect >>>>>> Akili Interactive >>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 3/8/16, 12:16 PM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote: >>>>>>> >>>>>>> Jeff - one of the problems with the rscala approach in 702 is it >>>>>>> doesn't take into account the R library. If rscala gets updated, the >>>>>>> user will likely download and update it automatically when they call >>>>>>> update.packages(). The result will be that the version of the rscala R >>>>>>> package doesn't match the version of the rscala jar, and the >>>>>>> interpreter will fail. Or, if the jar is also updated, it will simply >>>>>>> break 702. >>>>>>> >>>>>>> This has happened to 702 already-702 changed its library management >>>>>>> because an update to rscala broke the prior version. Actually though, >>>>>>> every rscala update is going to break 702. >>>>>>> >>>>>>> Regarding return values from the interpreter, the norm in R is that the >>>>>>> return value of the last expression is shown in a native format. So, if >>>>>>> the result is a data frame, a dataframe should be visualized. If the >>>>>>> result is an html object, it should be interpreted as html by the >>>>>>> browser. Do you disagree? >>>>>>> >>>>>>>> On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz >>>>>>>> <jeffrey.steinm...@gmail.com> wrote: >>>>>>>> >>>>>>>> RE 702: >>>>>>>> I wanted to respond to Eric’s discussion from December 30. >>>>>>>> >>>>>>>> I finally had some time to put aside a good chunk of dedicated, >>>>>>>> uninterrupted time. >>>>>>>> This means I had a chance to “really” dig into this with a Data >>>>>>>> Science R developer hat on. >>>>>>>> I also thought about this from a DevOps point of view (deploying in an >>>>>>>> EC2 cluster, standalone, locally, VM). >>>>>>>> I tested it with a spark installation outside of the zeppelin build - >>>>>>>> as if it was running on a cluster or standalone install. >>>>>>>> >>>>>>>> I also had a chance to dig under the hood a bit, and explore what the >>>>>>>> Java/Scala code in PR 702 is doing. >>>>>>>> >>>>>>>> I like the simplicity of this PR (the source code and approach). >>>>>>>> >>>>>>>> Works as expected, all graphic works, interactive charts works. >>>>>>>> >>>>>>>> I also see your point about Rendering the text result vs TABLE plot >>>>>>>> when the R interpreter result is a data frame. >>>>>>>> To confirm - the approach is to use %sql to display it in a native >>>>>>>> Zeppelin visualization. >>>>>>>> >>>>>>>> Your approach makes sense, since this in line with how this works in >>>>>>>> other Zeppelin work flows. >>>>>>>> I suppose you could add an R interpreter function, such as: >>>>>>>> z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a >>>>>>>> %table without having to jump to %sql (perhaps a nice addition in this >>>>>>>> or a future PR). >>>>>>>> >>>>>>>> It’s GREAT that %r print('%html') works with the Zeppelin display >>>>>>>> system! (as well as the other display system methods) >>>>>>>> >>>>>>>> Regarding rscala jar. You have a profile that will allow us to sync >>>>>>>> up the version rscala, so that makes sense as well. >>>>>>>> This too worked as expected. I specifically installed rscala (as you >>>>>>>> describe in your docs) in the VM with: >>>>>>>> >>>>>>>> curl >>>>>>>> https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz >>>>>>>> -o /tmp/rscala_1.0.6.tar.gz >>>>>>>> R CMD INSTALL /tmp/rscala_1.0.6.tar.gz >>>>>>>> >>>>>>>> >>>>>>>> Installing rscala outside of the Zeppelin dependencies does seem to >>>>>>>> keep this PR simpler, and reduces the licensing overhead required to >>>>>>>> get this PR through (based on comments I see from others) >>>>>>>> >>>>>>>> I would need to add the two rscala install lines above to PR#751 (I >>>>>>>> will add this today) >>>>>>>> https://github.com/apache/incubator-zeppelin/pull/751 >>>>>>>> >>>>>>>> >>>>>>>> Regarding the Interpreters. Just having %r as the our first >>>>>>>> interpreter keyword makes sense. Loading knitr within the >>>>>>>> interpreter to enable rendering (versus having a %knitr interpreter >>>>>>>> specifically) seems to keep things simple. >>>>>>>> >>>>>>>> In summary - Looks good since everything in your sample R notebook (as >>>>>>>> well as a few other tests I tried) worked for me using the VM script >>>>>>>> in PR#751. >>>>>>>> The documentation also facilitated a smooth installation and allowed >>>>>>>> me to create a repeatable script, that when paired with the VM worked >>>>>>>> as expected. >>>>>>>> >>>>>>>> ---- >>>>>>>> Jeff Steinmetz >>>>>>>> Principal Architect >>>>>>>> Akili Interactive >>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> From >>>>>>>>> Eric Charles <e...@apache.org> >>>>>>>>> >>>>>>>>> >>>>>>>>> Subject >>>>>>>>> [DISCUSS] PR #208 - R Interpreter for Zeppelin >>>>>>>>> >>>>>>>>> >>>>>>>>> Date >>>>>>>>> Wed, 30 Dec 2015 14:04:33 GMT >>>>>>>> >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I had a look at https://github.com/apache/incubator-zeppelin/pull/208 >>>>>>>>> (and related Github repo https://github.com/elbamos/Zeppelin-With-R >>>>>>>>> [1]) >>>>>>>>> >>>>>>>>> Here are a few topics for discussion based on my experience developing >>>>>>>>> https://github.com/datalayer/zeppelin-R [2]. >>>>>>>>> >>>>>>>>> 1. rscala jar not in Maven Repository >>>>>>>>> >>>>>>>>> [1] copies the source (scala and R) code from rscala repo and >>>>>>>>> changes/extends/repackages it a bit. [2] declares the jar as system >>>>>>>>> scoped library. I recently had incompatibly issues between the 1.0.8 >>>>>>>>> (the one you get since 2015-12-10 when you install rscala on your R >>>>>>>>> environment) and the 1.0.6 jar I am using part of the zeppelin-R >>>>>>>>> build. >>>>>>>>> To avoid such issues, why not the user choosing the version via a >>>>>>>>> property at build time to fit the version he runs on its host? This >>>>>>>>> will >>>>>>>>> also allow to benefit from the next rscala releases which fix bugs, >>>>>>>>> bring not features... This also means we don't have to copy the rscala >>>>>>>>> code in Zeppelin tree. >>>>>>>>> >>>>>>>>> 2. Interpreters >>>>>>>>> >>>>>>>>> [1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are >>>>>>>>> implemented in their own module apart from the Spark one. To be >>>>>>>>> aligned >>>>>>>>> the existing pyspark implementation, why not integrating the R code >>>>>>>>> into >>>>>>>>> the Spark one? Any reason to keep 2 versions which does basically the >>>>>>>>> same? The unique magic keyword would then be %spark.r >>>>>>>>> >>>>>>>>> 3. Rendering TABLE plot when interpreter result is a dataframe >>>>>>>>> >>>>>>>>> This may be confusing. What if I display a plot and simply want to >>>>>>>>> print >>>>>>>>> the first 10 rows at the end of my code? To keep the same behavior as >>>>>>>>> the other interpreters, we could make this feature optional (disabled >>>>>>>>> by >>>>>>>>> default, enabled via property). >>>>>>>>> >>>>>>>>> >>>>>>>>> Thx, Eric >>>>>>