Re: R Interpreter - PR 702

Jeff Steinmetz Wed, 09 Mar 2016 10:57:35 -0800

The package management thoughts I presented could be considered general 
suggestions for any R interpreter improvement, with implications beyond just 
rScala.


Researching options to lock down package management in the R notebook were a 
suggestions I raised, which I wouldn’t consider to be a show stoppers for 
getting an R interpreter off the ground as a first step.  
Although, if we came up with an awesome solution to help buckle down R package 
management in Zeppelin, there is no reason not to give it a try or discuss its 
utility.  
R Interpreter functionality and security can mature over time via small 
iterative improvements and collaboration.




Cheers, Jeff


On 3/9/16, 10:06 AM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:

>Eric, denying what happened is just silly. Theres a whole record of it.
>
>When you started, your code downloaded the latest rscala. Your code broke 
>entirely when rscala was updated. You scrambled to change it, and at the same 
>time you started bundling the binary so you could control the version of the 
>rscala jar, which you mistakenly thought would prevent it from breaking again. 
>Then you found out (from me) about the R library issue, so you started letting 
>the user pick the rscala version at build time. Which still doesn't fix it.
>
>Now, you say that implementing your solution requires an administrator to 
>lock-down the machine running Zeppelin, etc etc etc --- but to what benefit?  
>There's no way to take advantage of new rscala features anyway. 
>
>Regarding pyspark, we have to maintain version parity with spark. Python also 
>doesn't have r's library management system. 
>
>Why are you still defending this? It was a poor design decision. Move on.
>
>> On Mar 9, 2016, at 1:05 AM, Eric Charles <e...@apache.org> wrote:
>> 
>> 
>>> On 09/03/16 06:41, Amos B. Elberg wrote:
>>> That's not true eric. When rscala was updated to 1.0.8, your interpreter 
>>> broke entirely. You then rushed to fix it, and that's when you began 
>>> including the binary in the distribution. This is all in the commit logs.
>> 
>> I always shipped or downloaded the rscala jar...
>> 
>>> I recognize that you now allow the user to select an rscala version at 
>>> build time, which means they have to compile Zeppelin for a specific rscala 
>>> version.
>> 
>> User has not to choose. Packager and devops will do it for the user and user 
>> should not have the permission to install/update any package.
>> 
>>> What's the point? What you've achieved is to replace 3 short source files, 
>>> by introducing a proven instability, a maintenance burden on the user, and 
>>> a support burden on us when 200 people show up with obscure error messages 
>>> that have to be diagnosed.
>> 
>> Let me take the analogy with the py4j jar which fulfills a similar role as 
>> rscala jar, the role of binder between scala and another language.
>> 
>> With spark 1.6, the pyspark version has been updated. If Zeppelin had 
>> shipped the source code, zeppelin developer would have had the 
>> responsibility to update the complete source code of pyspark.
>> 
>> In our case (relying on external jar), upgrading from 0.8.2.1 to 0.9 was 
>> easy:
>> 
>> https://github.com/apache/incubator-zeppelin/pull/463/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR320
>> 
>> That approach has also the enormous advantage to support not aonly different 
>> rscala versions but also different scala profiles (2.10, 2.11...).
>> 
>>>> On Mar 9, 2016, at 12:22 AM, Eric Charles <e...@apache.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 09/03/16 06:05, Amos B. Elberg wrote:
>>>>> Jeff you're correct that when Zeppelin is being professionally  
>>>>> administered, the administrator can take care of all of this.
>>>>> 
>>>>> But why create an additional system administration task? And what about 
>>>>> users without professional systems admins?
>>>>> 
>>>>> The only "benefit" to doing it that way, is we bundle a binary instead of 
>>>>> source, when the source is likely to never need updating. That doesn't 
>>>>> seem like a "benefit" at all.
>>>>> 
>>>>> And in exchange for that, the cost is things like having to lock down R 
>>>>> or prevent package updates?  That doesn't make much sense to me.
>>>>> 
>>>>> The question of which method has more overhead has been answered 
>>>>> empirically: this issue already broke 702, and there have been a whole 
>>>>> series of revisions to it to address various issues, with no end in sight.
>>>> 
>>>> Amos, You mention a few times that 'the issue broke 702...'. I don't see 
>>>> when and why that particular approach broke anything.
>>>> 
>>>> I certainly experimented and reported that the rscala version alignment is 
>>>> important, like any version of linux, jdk, scala, spark... dependency.
>>>> 
>>>> What I changed, after Moon comment, is the download of the jar at build 
>>>> time instead of shipping the jar in the source tree. This approach has the 
>>>> advantage of being able to define at build time which version you want.
>>>> 
>>>>> Meanwhile, this part of 208 has been stable for six months, without a 
>>>>> single user issue.
>>>>> 
>>>>> 
>>>>>> On Mar 8, 2016, at 8:34 PM, Jeff Steinmetz <jeffrey.steinm...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> During my tests, I found the rScala installation its dependencies to be 
>>>>>> manageable - even though it may not be ideal (i.e. the source is not 
>>>>>> included).
>>>>>> The Zeppelin build already needs to target the correct version of Spark 
>>>>>> + Hadoop so for this exercise, I treat rScala similarly, as an 
>>>>>> additional build and install consideration.
>>>>>> 
>>>>>> I’m looking at this through a specific lens:
>>>>>> “As a technology decision maker in a company, could I and would I deploy 
>>>>>> this in our environment? Could I work with our Data Scientist team to 
>>>>>> implement its usage?  Would the Data Engineering and Data Science team 
>>>>>> that commonly use R find it useful?”
>>>>>> 
>>>>>> Inadvertent rScala updates could be minimized with education, letting 
>>>>>> the users know that R package management within a notebook should be 
>>>>>> avoided.  Which generally seems like a good idea regardless of how 
>>>>>> R-Zeppelin is implemented since it’s a shared environment (you don’t 
>>>>>> want to break other users graphs with an rGraph update or uninstall 
>>>>>> ggplot2, etc, etc.)
>>>>>> 
>>>>>> Even better - what if there was a zeppelin config that disabled 
>>>>>> `install.packages()`, `remove.packages()` and `update.packages()`.  This 
>>>>>> would allow package installation to be carried out only by 
>>>>>> administrators or devops outside of Zeppelin.
>>>>>> Although its not clear on the effort vs. benefit, I’m sure somebody 
>>>>>> crafty could come up with a way around this with a convoluted Eval or 
>>>>>> running something through the shell in Zeppelin.
>>>>>> 
>>>>>> R, Python and Scala all have pretty wide open door to parts of the 
>>>>>> underlying operating system.
>>>>>> A 100% bullet proof way to locking “everything" down is a tough 
>>>>>> challenge.
>>>>>> 
>>>>>> ----
>>>>>> Jeff Steinmetz
>>>>>> Principal Architect
>>>>>> Akili Interactive
>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 3/8/16, 12:16 PM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Jeff - one of the problems with the rscala approach in 702 is it 
>>>>>>> doesn't take into account the R library. If rscala gets updated, the 
>>>>>>> user will likely download and update it automatically when they call 
>>>>>>> update.packages(). The result will be that the version of the rscala R 
>>>>>>> package doesn't match the version of the rscala jar, and the 
>>>>>>> interpreter will fail. Or, if the jar is also updated, it will simply 
>>>>>>> break 702.
>>>>>>> 
>>>>>>> This has happened to 702 already-702 changed its library management 
>>>>>>> because an update to rscala broke the prior version. Actually though, 
>>>>>>> every rscala update is going to break 702.
>>>>>>> 
>>>>>>> Regarding return values from the interpreter, the norm in R is that the 
>>>>>>> return value of the last expression is shown in a native format. So, if 
>>>>>>> the result is a data frame, a dataframe should be visualized. If the 
>>>>>>> result is an html object, it should be interpreted as html by the 
>>>>>>> browser.  Do you disagree?
>>>>>>> 
>>>>>>>> On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz 
>>>>>>>> <jeffrey.steinm...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> RE 702:
>>>>>>>> I wanted to respond to Eric’s discussion from December 30.
>>>>>>>> 
>>>>>>>> I finally had some time to put aside a good chunk of dedicated, 
>>>>>>>> uninterrupted time.
>>>>>>>> This means I had a chance to “really” dig into this with a Data 
>>>>>>>> Science R developer hat on.
>>>>>>>> I also thought about this from a DevOps point of view (deploying in an 
>>>>>>>> EC2 cluster, standalone, locally, VM).
>>>>>>>> I tested it with a spark installation outside of the zeppelin build - 
>>>>>>>> as if it was running on a cluster or standalone install.
>>>>>>>> 
>>>>>>>> I also had a chance to dig under the hood a bit, and explore what the 
>>>>>>>> Java/Scala code in PR 702 is doing.
>>>>>>>> 
>>>>>>>> I like the simplicity of this PR (the source code and approach).
>>>>>>>> 
>>>>>>>> Works as expected, all graphic works, interactive charts works.
>>>>>>>> 
>>>>>>>> I also see your point about Rendering the text result vs TABLE plot 
>>>>>>>> when the R interpreter result is a data frame.
>>>>>>>> To confirm - the approach is to use  %sql to display it in a native 
>>>>>>>> Zeppelin visualization.
>>>>>>>> 
>>>>>>>> Your approach makes sense, since this in line with how this works in 
>>>>>>>> other Zeppelin work flows.
>>>>>>>> I suppose you could add an R interpreter function, such as: 
>>>>>>>> z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a 
>>>>>>>> %table without having to jump to %sql (perhaps a nice addition in this 
>>>>>>>> or a future PR).
>>>>>>>> 
>>>>>>>> It’s GREAT that %r print('%html') works with the Zeppelin display 
>>>>>>>> system!  (as well as the other display system methods)
>>>>>>>> 
>>>>>>>> Regarding rscala jar.  You have a profile that will allow us to sync 
>>>>>>>> up the version rscala, so that makes sense as well.
>>>>>>>> This too worked as expected.  I specifically installed rscala (as you 
>>>>>>>> describe in your docs) in the VM with:
>>>>>>>> 
>>>>>>>> curl 
>>>>>>>> https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz
>>>>>>>>  -o /tmp/rscala_1.0.6.tar.gz
>>>>>>>> R CMD INSTALL /tmp/rscala_1.0.6.tar.gz
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Installing rscala outside of the Zeppelin dependencies does seem to 
>>>>>>>> keep this PR simpler, and reduces the licensing overhead required to 
>>>>>>>> get this PR through (based on comments I see from others)
>>>>>>>> 
>>>>>>>> I would need to add the two rscala install lines above to PR#751 (I 
>>>>>>>> will add this today)
>>>>>>>> https://github.com/apache/incubator-zeppelin/pull/751
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regarding the Interpreters.   Just having %r as the our first 
>>>>>>>> interpreter keyword makes sense.   Loading knitr within the 
>>>>>>>> interpreter to enable rendering (versus having a %knitr interpreter 
>>>>>>>> specifically) seems to keep things simple.
>>>>>>>> 
>>>>>>>> In summary - Looks good since everything in your sample R notebook (as 
>>>>>>>> well as a few other tests I tried) worked for me using the VM script 
>>>>>>>> in PR#751.
>>>>>>>> The documentation also facilitated a smooth installation and allowed 
>>>>>>>> me to create a repeatable script, that when paired with the VM worked 
>>>>>>>> as expected.
>>>>>>>> 
>>>>>>>> ----
>>>>>>>> Jeff Steinmetz
>>>>>>>> Principal Architect
>>>>>>>> Akili Interactive
>>>>>>>> www.akiliinteractive.com <http://www.akiliinteractive.com/>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> From
>>>>>>>>>  Eric Charles <e...@apache.org>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  Subject
>>>>>>>>>  [DISCUSS] PR #208 - R Interpreter for Zeppelin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  Date
>>>>>>>>> Wed, 30 Dec 2015 14:04:33 GMT
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I had a look at https://github.com/apache/incubator-zeppelin/pull/208
>>>>>>>>> (and related Github repo https://github.com/elbamos/Zeppelin-With-R 
>>>>>>>>> [1])
>>>>>>>>> 
>>>>>>>>> Here are a few topics for discussion based on my experience developing
>>>>>>>>> https://github.com/datalayer/zeppelin-R [2].
>>>>>>>>> 
>>>>>>>>> 1. rscala jar not in Maven Repository
>>>>>>>>> 
>>>>>>>>> [1] copies the source (scala and R) code from rscala repo and
>>>>>>>>> changes/extends/repackages it a bit. [2] declares the jar as system
>>>>>>>>> scoped library. I recently had incompatibly issues between the 1.0.8
>>>>>>>>> (the one you get since 2015-12-10 when you install rscala on your R
>>>>>>>>> environment) and the 1.0.6 jar I am using part of the zeppelin-R 
>>>>>>>>> build.
>>>>>>>>> To avoid such issues, why not the user choosing the version via a
>>>>>>>>> property at build time to fit the version he runs on its host? This 
>>>>>>>>> will
>>>>>>>>> also allow to benefit from the next rscala releases which fix bugs,
>>>>>>>>> bring not features... This also means we don't have to copy the rscala
>>>>>>>>> code in Zeppelin tree.
>>>>>>>>> 
>>>>>>>>> 2. Interpreters
>>>>>>>>> 
>>>>>>>>> [1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are
>>>>>>>>> implemented in their own module apart from the Spark one. To be 
>>>>>>>>> aligned
>>>>>>>>> the existing pyspark implementation, why not integrating the R code 
>>>>>>>>> into
>>>>>>>>> the Spark one? Any reason to keep 2 versions which does basically the
>>>>>>>>> same? The unique magic keyword would then be %spark.r
>>>>>>>>> 
>>>>>>>>> 3. Rendering TABLE plot when interpreter result is a dataframe
>>>>>>>>> 
>>>>>>>>> This may be confusing. What if I display a plot and simply want to 
>>>>>>>>> print
>>>>>>>>> the first 10 rows at the end of my code? To keep the same behavior as
>>>>>>>>> the other interpreters, we could make this feature optional (disabled 
>>>>>>>>> by
>>>>>>>>> default, enabled via property).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thx, Eric
>>>>>>

Re: R Interpreter - PR 702

Reply via email to