Re: R Interpreter - PR 702

Eric Charles Tue, 08 Mar 2016 21:16:50 -0800

Two more points from the top of my head:

1.- Shipping the rscala source/package will not minimize the risk ofusers breaking the deployment. In other words, even if you ship thepackage with zeppelin, you still need to avoid users overriding withconflicting rscala versions...

2.- We also have the recent Dedicated interpreter session per notebookhttps://github.com/apache/incubator-zeppelin/pull/703. For now, this isnot really addressed. I will look at the impacts in case of separatedsession...




On 09/03/16 05:49, Eric Charles wrote:

Jeff, Thx for taking time to test and give feedbacks.

Regarding the risk of a rscala update by an user, a few ideas/questions:

1.- I wonder if zeppelin could be run as a user with restricted rights
(install.packages downloads the packages, that folder could be created
with non writable rights - Also, the compilation result is written in
/usr/... which should not be writable by the zeppelin user).

2.- You could simply uninstall the compilation libraries from the host
so the packages can not be installed.

3.- Obviously, as Jeff proposes, reject any R with install.*() (not sure
if this must be by configuration as it is easy to enable it) -> I take
it as todo...

This sounds to me like classical considerations deploying a system in an
environment (e.g. user and file system rights, no %sh interpreter...)

For the dataframe rendering, scala as R is used to consider the type of
the last expression. For consistency reasons with the other interpreters
(%spark...), I was thinking that the R interpreter should not take a
decision on this. But if users think otherwise, the behavior can be
adapted of course. I also take Jeff suggestion ( z.R.showDFAsTable(fooDF)).



On 09/03/16 02:34, Jeff Steinmetz wrote:

During my tests, I found the rScala installation its dependencies to
be manageable - even though it may not be ideal (i.e. the source is
not included).
The Zeppelin build already needs to target the correct version of
Spark + Hadoop so for this exercise, I treat rScala similarly, as an
additional build and install consideration.

I’m looking at this through a specific lens:
“As a technology decision maker in a company, could I and would I
deploy this in our environment? Could I work with our Data Scientist
team to implement its usage?  Would the Data Engineering and Data
Science team that commonly use R find it useful?”

Inadvertent rScala updates could be minimized with education, letting
the users know that R package management within a notebook should be
avoided.  Which generally seems like a good idea regardless of how
R-Zeppelin is implemented since it’s a shared environment (you don’t
want to break other users graphs with an rGraph update or uninstall
ggplot2, etc, etc.)

Even better - what if there was a zeppelin config that disabled
`install.packages()`, `remove.packages()` and `update.packages()`.
This would allow package installation to be carried out only by
administrators or devops outside of Zeppelin.
Although its not clear on the effort vs. benefit, I’m sure somebody
crafty could come up with a way around this with a convoluted Eval or
running something through the shell in Zeppelin.

R, Python and Scala all have pretty wide open door to parts of the
underlying operating system.
A 100% bullet proof way to locking “everything" down is a tough
challenge.

----
Jeff Steinmetz
Principal Architect
Akili Interactive
www.akiliinteractive.com <http://www.akiliinteractive.com/>






On 3/8/16, 12:16 PM, "Amos B. Elberg" <[email protected]> wrote:

Jeff - one of the problems with the rscala approach in 702 is it
doesn't take into account the R library. If rscala gets updated, the
user will likely download and update it automatically when they call
update.packages(). The result will be that the version of the rscala
R package doesn't match the version of the rscala jar, and the
interpreter will fail. Or, if the jar is also updated, it will simply
break 702.

This has happened to 702 already-702 changed its library management
because an update to rscala broke the prior version. Actually though,
every rscala update is going to break 702.

Regarding return values from the interpreter, the norm in R is that
the return value of the last expression is shown in a native format.
So, if the result is a data frame, a dataframe should be visualized.
If the result is an html object, it should be interpreted as html by
the browser.  Do you disagree?

On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz
<[email protected]> wrote:

RE 702:
I wanted to respond to Eric’s discussion from December 30.

I finally had some time to put aside a good chunk of dedicated,
uninterrupted time.
This means I had a chance to “really” dig into this with a Data
Science R developer hat on.
I also thought about this from a DevOps point of view (deploying in
an EC2 cluster, standalone, locally, VM).
I tested it with a spark installation outside of the zeppelin build
- as if it was running on a cluster or standalone install.

I also had a chance to dig under the hood a bit, and explore what
the Java/Scala code in PR 702 is doing.

I like the simplicity of this PR (the source code and approach).

Works as expected, all graphic works, interactive charts works.

I also see your point about Rendering the text result vs TABLE plot
when the R interpreter result is a data frame.
To confirm - the approach is to use  %sql to display it in a native
Zeppelin visualization.

Your approach makes sense, since this in line with how this works in
other Zeppelin work flows.
I suppose you could add an R interpreter function, such as:
z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a
%table without having to jump to %sql (perhaps a nice addition in
this or a future PR).

It’s GREAT that %r print('%html') works with the Zeppelin display
system!  (as well as the other display system methods)

Regarding rscala jar.  You have a profile that will allow us to sync
up the version rscala, so that makes sense as well.
This too worked as expected.  I specifically installed rscala (as
you describe in your docs) in the VM with:

curl
https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz
-o /tmp/rscala_1.0.6.tar.gz
R CMD INSTALL /tmp/rscala_1.0.6.tar.gz


Installing rscala outside of the Zeppelin dependencies does seem to
keep this PR simpler, and reduces the licensing overhead required to
get this PR through (based on comments I see from others)

I would need to add the two rscala install lines above to PR#751 (I
will add this today)
https://github.com/apache/incubator-zeppelin/pull/751


Regarding the Interpreters.   Just having %r as the our first
interpreter keyword makes sense.   Loading knitr within the
interpreter to enable rendering (versus having a %knitr interpreter
specifically) seems to keep things simple.

In summary - Looks good since everything in your sample R notebook
(as well as a few other tests I tried) worked for me using the VM
script in PR#751.
The documentation also facilitated a smooth installation and allowed
me to create a repeatable script, that when paired with the VM
worked as expected.

----
Jeff Steinmetz
Principal Architect
Akili Interactive
www.akiliinteractive.com <http://www.akiliinteractive.com/>

From
   Eric Charles <[email protected]>


   Subject
   [DISCUSS] PR #208 - R Interpreter for Zeppelin


   Date
Wed, 30 Dec 2015 14:04:33 GMT

Hi,

I had a look at https://github.com/apache/incubator-zeppelin/pull/208
(and related Github repo https://github.com/elbamos/Zeppelin-With-R
[1])

Here are a few topics for discussion based on my experience developing
https://github.com/datalayer/zeppelin-R [2].

1. rscala jar not in Maven Repository

[1] copies the source (scala and R) code from rscala repo and
changes/extends/repackages it a bit. [2] declares the jar as system
scoped library. I recently had incompatibly issues between the 1.0.8
(the one you get since 2015-12-10 when you install rscala on your R
environment) and the 1.0.6 jar I am using part of the zeppelin-R
build.
To avoid such issues, why not the user choosing the version via a
property at build time to fit the version he runs on its host? This
will
also allow to benefit from the next rscala releases which fix bugs,
bring not features... This also means we don't have to copy the rscala
code in Zeppelin tree.

2. Interpreters

[1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are
implemented in their own module apart from the Spark one. To be
aligned
the existing pyspark implementation, why not integrating the R code
into
the Spark one? Any reason to keep 2 versions which does basically the
same? The unique magic keyword would then be %spark.r

3. Rendering TABLE plot when interpreter result is a dataframe

This may be confusing. What if I display a plot and simply want to
print
the first 10 rows at the end of my code? To keep the same behavior as
the other interpreters, we could make this feature optional
(disabled by
default, enabled via property).


Thx, Eric

Re: R Interpreter - PR 702

Reply via email to