Re: R Interpreter - PR 702

Amos B. Elberg Sun, 13 Mar 2016 15:20:56 -0700

@enzo - 208 does not need an R administrator in server of single user mode. 
This is because the R package that 208 uses is segregated--the same approach is 
used by other server-based R systems like RStudio and shiny.


702, the problem is unsolvable except by locking-down the system's R 
installation entirely. 

Again, nothing is achieved by the "design-choice" in 702. I have been asking 
for a long time, and no one has even tried to offer any benefit to it.

I can't see any reason we're still talking about 702-in fact any reason 702 
exists at all-other than the games that some people have been playing with 208 
for the past six months.

It's time to stop this nonsense.

> On Mar 13, 2016, at 5:27 PM, enzo <e...@smartinsightsfromdata.com> wrote:
> 
> Just a side consideration on package management, expanding some of Jaff’s 
> comments.
> 
> As well all know R is in essence a “single user” tool.  I assume at the 
> beginning Zeppelin will be the same.
> 
> On my machine I have two libraries where my R packages are stored, a system 
> library and a user library.
> 
> I imagine Zeppelin will “discover” my personal user library and of course the 
> system library.  As such there is no need for anything else.  Any package 
> necessary for the R interpreter(s) could be stored in the user’s library.  
> 
> This would work, but in principle a user could still download rscala or 
> similar packages to use for example with Rstudio or in a parallel instance of 
> Jupyter (using IRkernel), hence the way Zeppelin will manage these will have 
> to be error-proof (maybe managing a Zeppelin library, not accessible / 
> updatable by RStudio or R GUI on the system?  I am not sure what is the 
> design of choice of PR 702 or PR 208).
> 
> While maybe premature, I think we need also to discuss what is going to 
> happen in case of a Zeppelin Server, serving many users. This extends a bit 
> normal operating patterns for R.  I don’t think in such a case we should 
> expect to have a R administrator managing packages outside of Zeppelin (does 
> Zeppelin plan to have in such a case an admin interface??). 
> 
> In case of Zeppelin Server different scenarios could be applied:
> A possibility would be to have a single library dedicated to the server where 
> all users share all packages. The issue would be how to maintain it, and what 
> if different users will require different versions of some package?  While it 
> may appear cumbersome, it may be appropriate in environments where they plan 
> to control rigidly which version of a package is applied when.
> Possibly the most functionally complete approach would be to have a library 
> per users (plus probably a “private” library for Zeppelin - or dedicated 
> packages copied into each library, as currently done by Rstudio with 
> rstudioapi).
> 
> What are the plans  / ideas on this for PR 208 / 702?
> 
> It would be interesting to have the re-assurance that whichever design there 
> will be flexibility to implement different approached in the future...
> 
> 
> 
> Enzo
> e...@smartinsightsfromdata.com
> 
> 
> 
>> On 9 Mar 2016, at 18:56, Jeff Steinmetz <jeffrey.steinm...@gmail.com> wrote:
>> 
>> The package management thoughts I presented could be considered general 
>> suggestions for any R interpreter improvement, with implications beyond just 
>> rScala.
>> 
>> Researching options to lock down package management in the R notebook were a 
>> suggestions I raised, which I wouldn’t consider to be a show stoppers for 
>> getting an R interpreter off the ground as a first step.  
>> Although, if we came up with an awesome solution to help buckle down R 
>> package management in Zeppelin, there is no reason not to give it a try or 
>> discuss its utility.  
>> R Interpreter functionality and security can mature over time via small 
>> iterative improvements and collaboration.
>> 
>> 
>> 
>> 
>> Cheers, Jeff
>> 
>> 
>>> On 3/9/16, 10:06 AM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
>>> 
>>> Eric, denying what happened is just silly. Theres a whole record of it.
>>> 
>>> When you started, your code downloaded the latest rscala. Your code broke 
>>> entirely when rscala was updated. You scrambled to change it, and at the 
>>> same time you started bundling the binary so you could control the version 
>>> of the rscala jar, which you mistakenly thought would prevent it from 
>>> breaking again. Then you found out (from me) about the R library issue, so 
>>> you started letting the user pick the rscala version at build time. Which 
>>> still doesn't fix it.
>>> 
>>> Now, you say that implementing your solution requires an administrator to 
>>> lock-down the machine running Zeppelin, etc etc etc --- but to what 
>>> benefit?  There's no way to take advantage of new rscala features anyway. 
>>> 
>>> Regarding pyspark, we have to maintain version parity with spark. Python 
>>> also doesn't have r's library management system. 
>>> 
>>> Why are you still defending this? It was a poor design decision. Move on.
>>> 
>>>> On Mar 9, 2016, at 1:05 AM, Eric Charles <e...@apache.org> wrote:
>>>> 
>>>> 
>>>>> On 09/03/16 06:41, Amos B. Elberg wrote:
>>>>> That's not true eric. When rscala was updated to 1.0.8, your interpreter 
>>>>> broke entirely. You then rushed to fix it, and that's when you began 
>>>>> including the binary in the distribution. This is all in the commit logs.
>>>> 
>>>> I always shipped or downloaded the rscala jar...
>>>> 
>>>>> I recognize that you now allow the user to select an rscala version at 
>>>>> build time, which means they have to compile Zeppelin for a specific 
>>>>> rscala version.
>>>> 
>>>> User has not to choose. Packager and devops will do it for the user and 
>>>> user should not have the permission to install/update any package.
>>>> 
>>>>> What's the point? What you've achieved is to replace 3 short source 
>>>>> files, by introducing a proven instability, a maintenance burden on the 
>>>>> user, and a support burden on us when 200 people show up with obscure 
>>>>> error messages that have to be diagnosed.
>>>> 
>>>> Let me take the analogy with the py4j jar which fulfills a similar role as 
>>>> rscala jar, the role of binder between scala and another language.
>>>> 
>>>> With spark 1.6, the pyspark version has been updated. If Zeppelin had 
>>>> shipped the source code, zeppelin developer would have had the 
>>>> responsibility to update the complete source code of pyspark.
>>>> 
>>>> In our case (relying on external jar), upgrading from 0.8.2.1 to 0.9 was 
>>>> easy:
>>>> 
>>>> https://github.com/apache/incubator-zeppelin/pull/463/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR320
>>>> 
>>>> That approach has also the enormous advantage to support not aonly 
>>>> different rscala versions but also different scala profiles (2.10, 
>>>> 2.11...).
>>>> 
>>>>>> On Mar 9, 2016, at 12:22 AM, Eric Charles <e...@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 09/03/16 06:05, Amos B. Elberg wrote:
>>>>>>> Jeff you're correct that when Zeppelin is being professionally  
>>>>>>> administered, the administrator can take care of all of this.
>>>>>>> 
>>>>>>> But why create an additional system administration task? And what about 
>>>>>>> users without professional systems admins?
>>>>>>> 
>>>>>>> The only "benefit" to doing it that way, is we bundle a binary instead 
>>>>>>> of source, when the source is likely to never need updating. That 
>>>>>>> doesn't seem like a "benefit" at all.
>>>>>>> 
>>>>>>> And in exchange for that, the cost is things like having to lock down R 
>>>>>>> or prevent package updates?  That doesn't make much sense to me.
>>>>>>> 
>>>>>>> The question of which method has more overhead has been answered 
>>>>>>> empirically: this issue already broke 702, and there have been a whole 
>>>>>>> series of revisions to it to address various issues, with no end in 
>>>>>>> sight.
>>>>>> 
>>>>>> Amos, You mention a few times that 'the issue broke 702...'. I don't see 
>>>>>> when and why that particular approach broke anything.
>>>>>> 
>>>>>> I certainly experimented and reported that the rscala version alignment 
>>>>>> is important, like any version of linux, jdk, scala, spark... dependency.
>>>>>> 
>>>>>> What I changed, after Moon comment, is the download of the jar at build 
>>>>>> time instead of shipping the jar in the source tree. This approach has 
>>>>>> the advantage of being able to define at build time which version you 
>>>>>> want.
>>>>>> 
>>>>>>> Meanwhile, this part of 208 has been stable for six months, without a 
>>>>>>> single user issue.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 8, 2016, at 8:34 PM, Jeff Steinmetz 
>>>>>>>> <jeffrey.steinm...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> During my tests, I found the rScala installation its dependencies to 
>>>>>>>> be manageable - even though it may not be ideal (i.e. the source is 
>>>>>>>> not included).
>>>>>>>> The Zeppelin build already needs to target the correct version of 
>>>>>>>> Spark + Hadoop so for this exercise, I treat rScala similarly, as an 
>>>>>>>> additional build and install consideration.
>>>>>>>> 
>>>>>>>> I’m looking at this through a specific lens:
>>>>>>>> “As a technology decision maker in a company, could I and would I 
>>>>>>>> deploy this in our environment? Could I work with our Data Scientist 
>>>>>>>> team to implement its usage?  Would the Data Engineering and Data 
>>>>>>>> Science team that commonly use R find it useful?”
>>>>>>>> 
>>>>>>>> Inadvertent rScala updates could be minimized with education, letting 
>>>>>>>> the users know that R package management within a notebook should be 
>>>>>>>> avoided.  Which generally seems like a good idea regardless of how 
>>>>>>>> R-Zeppelin is implemented since it’s a shared environment (you don’t 
>>>>>>>> want to break other users graphs with an rGraph update or uninstall 
>>>>>>>> ggplot2, etc, etc.)
>>>>>>>> 
>>>>>>>> Even better - what if there was a zeppelin config that disabled 
>>>>>>>> `install.packages()`, `remove.packages()` and `update.packages()`.  
>>>>>>>> This would allow package installation to be carried out only by 
>>>>>>>> administrators or devops outside of Zeppelin.
>>>>>>>> Although its not clear on the effort vs. benefit, I’m sure somebody 
>>>>>>>> crafty could come up with a way around this with a convoluted Eval or 
>>>>>>>> running something through the shell in Zeppelin.
>>>>>>>> 
>>>>>>>> R, Python and Scala all have pretty wide open door to parts of the 
>>>>>>>> underlying operating system.
>>>>>>>> A 100% bullet proof way to locking “everything" down is a tough 
>>>>>>>> challenge.
>>>>>>>> 
>>>>>>>> ----
>>>>>>>> Jeff Steinmetz
>>>>>>>> Principal Architect
>>>>>>>> Akili Interactive
>>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 3/8/16, 12:16 PM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Jeff - one of the problems with the rscala approach in 702 is it 
>>>>>>>>> doesn't take into account the R library. If rscala gets updated, the 
>>>>>>>>> user will likely download and update it automatically when they call 
>>>>>>>>> update.packages(). The result will be that the version of the rscala 
>>>>>>>>> R package doesn't match the version of the rscala jar, and the 
>>>>>>>>> interpreter will fail. Or, if the jar is also updated, it will simply 
>>>>>>>>> break 702.
>>>>>>>>> 
>>>>>>>>> This has happened to 702 already-702 changed its library management 
>>>>>>>>> because an update to rscala broke the prior version. Actually though, 
>>>>>>>>> every rscala update is going to break 702.
>>>>>>>>> 
>>>>>>>>> Regarding return values from the interpreter, the norm in R is that 
>>>>>>>>> the return value of the last expression is shown in a native format. 
>>>>>>>>> So, if the result is a data frame, a dataframe should be visualized. 
>>>>>>>>> If the result is an html object, it should be interpreted as html by 
>>>>>>>>> the browser.  Do you disagree?
>>>>>>>>> 
>>>>>>>>>> On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz 
>>>>>>>>>> <jeffrey.steinm...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> RE 702:
>>>>>>>>>> I wanted to respond to Eric’s discussion from December 30.
>>>>>>>>>> 
>>>>>>>>>> I finally had some time to put aside a good chunk of dedicated, 
>>>>>>>>>> uninterrupted time.
>>>>>>>>>> This means I had a chance to “really” dig into this with a Data 
>>>>>>>>>> Science R developer hat on.
>>>>>>>>>> I also thought about this from a DevOps point of view (deploying in 
>>>>>>>>>> an EC2 cluster, standalone, locally, VM).
>>>>>>>>>> I tested it with a spark installation outside of the zeppelin build 
>>>>>>>>>> - as if it was running on a cluster or standalone install.
>>>>>>>>>> 
>>>>>>>>>> I also had a chance to dig under the hood a bit, and explore what 
>>>>>>>>>> the Java/Scala code in PR 702 is doing.
>>>>>>>>>> 
>>>>>>>>>> I like the simplicity of this PR (the source code and approach).
>>>>>>>>>> 
>>>>>>>>>> Works as expected, all graphic works, interactive charts works.
>>>>>>>>>> 
>>>>>>>>>> I also see your point about Rendering the text result vs TABLE plot 
>>>>>>>>>> when the R interpreter result is a data frame.
>>>>>>>>>> To confirm - the approach is to use  %sql to display it in a native 
>>>>>>>>>> Zeppelin visualization.
>>>>>>>>>> 
>>>>>>>>>> Your approach makes sense, since this in line with how this works in 
>>>>>>>>>> other Zeppelin work flows.
>>>>>>>>>> I suppose you could add an R interpreter function, such as: 
>>>>>>>>>> z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a 
>>>>>>>>>> %table without having to jump to %sql (perhaps a nice addition in 
>>>>>>>>>> this or a future PR).
>>>>>>>>>> 
>>>>>>>>>> It’s GREAT that %r print('%html') works with the Zeppelin display 
>>>>>>>>>> system!  (as well as the other display system methods)
>>>>>>>>>> 
>>>>>>>>>> Regarding rscala jar.  You have a profile that will allow us to sync 
>>>>>>>>>> up the version rscala, so that makes sense as well.
>>>>>>>>>> This too worked as expected.  I specifically installed rscala (as 
>>>>>>>>>> you describe in your docs) in the VM with:
>>>>>>>>>> 
>>>>>>>>>> curl 
>>>>>>>>>> https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz
>>>>>>>>>>  -o /tmp/rscala_1.0.6.tar.gz
>>>>>>>>>> R CMD INSTALL /tmp/rscala_1.0.6.tar.gz
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Installing rscala outside of the Zeppelin dependencies does seem to 
>>>>>>>>>> keep this PR simpler, and reduces the licensing overhead required to 
>>>>>>>>>> get this PR through (based on comments I see from others)
>>>>>>>>>> 
>>>>>>>>>> I would need to add the two rscala install lines above to PR#751 (I 
>>>>>>>>>> will add this today)
>>>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/751
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regarding the Interpreters.   Just having %r as the our first 
>>>>>>>>>> interpreter keyword makes sense.   Loading knitr within the 
>>>>>>>>>> interpreter to enable rendering (versus having a %knitr interpreter 
>>>>>>>>>> specifically) seems to keep things simple.
>>>>>>>>>> 
>>>>>>>>>> In summary - Looks good since everything in your sample R notebook 
>>>>>>>>>> (as well as a few other tests I tried) worked for me using the VM 
>>>>>>>>>> script in PR#751.
>>>>>>>>>> The documentation also facilitated a smooth installation and allowed 
>>>>>>>>>> me to create a repeatable script, that when paired with the VM 
>>>>>>>>>> worked as expected.
>>>>>>>>>> 
>>>>>>>>>> ----
>>>>>>>>>> Jeff Steinmetz
>>>>>>>>>> Principal Architect
>>>>>>>>>> Akili Interactive
>>>>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> From
>>>>>>>>>>> Eric Charles <e...@apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Subject
>>>>>>>>>>> [DISCUSS] PR #208 - R Interpreter for Zeppelin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Date
>>>>>>>>>>> Wed, 30 Dec 2015 14:04:33 GMT
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I had a look at 
>>>>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/208
>>>>>>>>>>> (and related Github repo https://github.com/elbamos/Zeppelin-With-R 
>>>>>>>>>>> [1])
>>>>>>>>>>> 
>>>>>>>>>>> Here are a few topics for discussion based on my experience 
>>>>>>>>>>> developing
>>>>>>>>>>> https://github.com/datalayer/zeppelin-R [2].
>>>>>>>>>>> 
>>>>>>>>>>> 1. rscala jar not in Maven Repository
>>>>>>>>>>> 
>>>>>>>>>>> [1] copies the source (scala and R) code from rscala repo and
>>>>>>>>>>> changes/extends/repackages it a bit. [2] declares the jar as system
>>>>>>>>>>> scoped library. I recently had incompatibly issues between the 1.0.8
>>>>>>>>>>> (the one you get since 2015-12-10 when you install rscala on your R
>>>>>>>>>>> environment) and the 1.0.6 jar I am using part of the zeppelin-R 
>>>>>>>>>>> build.
>>>>>>>>>>> To avoid such issues, why not the user choosing the version via a
>>>>>>>>>>> property at build time to fit the version he runs on its host? This 
>>>>>>>>>>> will
>>>>>>>>>>> also allow to benefit from the next rscala releases which fix bugs,
>>>>>>>>>>> bring not features... This also means we don't have to copy the 
>>>>>>>>>>> rscala
>>>>>>>>>>> code in Zeppelin tree.
>>>>>>>>>>> 
>>>>>>>>>>> 2. Interpreters
>>>>>>>>>>> 
>>>>>>>>>>> [1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are
>>>>>>>>>>> implemented in their own module apart from the Spark one. To be 
>>>>>>>>>>> aligned
>>>>>>>>>>> the existing pyspark implementation, why not integrating the R code 
>>>>>>>>>>> into
>>>>>>>>>>> the Spark one? Any reason to keep 2 versions which does basically 
>>>>>>>>>>> the
>>>>>>>>>>> same? The unique magic keyword would then be %spark.r
>>>>>>>>>>> 
>>>>>>>>>>> 3. Rendering TABLE plot when interpreter result is a dataframe
>>>>>>>>>>> 
>>>>>>>>>>> This may be confusing. What if I display a plot and simply want to 
>>>>>>>>>>> print
>>>>>>>>>>> the first 10 rows at the end of my code? To keep the same behavior 
>>>>>>>>>>> as
>>>>>>>>>>> the other interpreters, we could make this feature optional 
>>>>>>>>>>> (disabled by
>>>>>>>>>>> default, enabled via property).
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thx, Eric
>>>>>>>> 
>> 
>> 
>

Re: R Interpreter - PR 702

Reply via email to