Jeff, Thx for taking time to test and give feedbacks.
Regarding the risk of a rscala update by an user, a few ideas/questions:
1.- I wonder if zeppelin could be run as a user with restricted rights
(install.packages downloads the packages, that folder could be created
with non writable rights - Also, the compilation result is written in
/usr/... which should not be writable by the zeppelin user).
2.- You could simply uninstall the compilation libraries from the host
so the packages can not be installed.
3.- Obviously, as Jeff proposes, reject any R with install.*() (not sure
if this must be by configuration as it is easy to enable it) -> I take
it as todo...
This sounds to me like classical considerations deploying a system in an
environment (e.g. user and file system rights, no %sh interpreter...)
For the dataframe rendering, scala as R is used to consider the type of
the last expression. For consistency reasons with the other interpreters
(%spark...), I was thinking that the R interpreter should not take a
decision on this. But if users think otherwise, the behavior can be
adapted of course. I also take Jeff suggestion ( z.R.showDFAsTable(fooDF)).
On 09/03/16 02:34, Jeff Steinmetz wrote:
During my tests, I found the rScala installation its dependencies to be
manageable - even though it may not be ideal (i.e. the source is not included).
The Zeppelin build already needs to target the correct version of Spark +
Hadoop so for this exercise, I treat rScala similarly, as an additional build
and install consideration.
I’m looking at this through a specific lens:
“As a technology decision maker in a company, could I and would I deploy this
in our environment? Could I work with our Data Scientist team to implement its
usage? Would the Data Engineering and Data Science team that commonly use R
find it useful?”
Inadvertent rScala updates could be minimized with education, letting the users
know that R package management within a notebook should be avoided. Which
generally seems like a good idea regardless of how R-Zeppelin is implemented
since it’s a shared environment (you don’t want to break other users graphs
with an rGraph update or uninstall ggplot2, etc, etc.)
Even better - what if there was a zeppelin config that disabled
`install.packages()`, `remove.packages()` and `update.packages()`. This would
allow package installation to be carried out only by administrators or devops
outside of Zeppelin.
Although its not clear on the effort vs. benefit, I’m sure somebody crafty
could come up with a way around this with a convoluted Eval or running
something through the shell in Zeppelin.
R, Python and Scala all have pretty wide open door to parts of the underlying
operating system.
A 100% bullet proof way to locking “everything" down is a tough challenge.
----
Jeff Steinmetz
Principal Architect
Akili Interactive
www.akiliinteractive.com <http://www.akiliinteractive.com/>
On 3/8/16, 12:16 PM, "Amos B. Elberg" <amos.elb...@gmail.com> wrote:
Jeff - one of the problems with the rscala approach in 702 is it doesn't take
into account the R library. If rscala gets updated, the user will likely
download and update it automatically when they call update.packages(). The
result will be that the version of the rscala R package doesn't match the
version of the rscala jar, and the interpreter will fail. Or, if the jar is
also updated, it will simply break 702.
This has happened to 702 already-702 changed its library management because an
update to rscala broke the prior version. Actually though, every rscala update
is going to break 702.
Regarding return values from the interpreter, the norm in R is that the return
value of the last expression is shown in a native format. So, if the result is
a data frame, a dataframe should be visualized. If the result is an html
object, it should be interpreted as html by the browser. Do you disagree?
On Mar 8, 2016, at 2:52 PM, Jeff Steinmetz <jeffrey.steinm...@gmail.com> wrote:
RE 702:
I wanted to respond to Eric’s discussion from December 30.
I finally had some time to put aside a good chunk of dedicated, uninterrupted
time.
This means I had a chance to “really” dig into this with a Data Science R
developer hat on.
I also thought about this from a DevOps point of view (deploying in an EC2
cluster, standalone, locally, VM).
I tested it with a spark installation outside of the zeppelin build - as if it
was running on a cluster or standalone install.
I also had a chance to dig under the hood a bit, and explore what the
Java/Scala code in PR 702 is doing.
I like the simplicity of this PR (the source code and approach).
Works as expected, all graphic works, interactive charts works.
I also see your point about Rendering the text result vs TABLE plot when the R
interpreter result is a data frame.
To confirm - the approach is to use %sql to display it in a native Zeppelin
visualization.
Your approach makes sense, since this in line with how this works in other
Zeppelin work flows.
I suppose you could add an R interpreter function, such as:
z.R.showDFAsTable(fooDF) if we wanted to force the data frame into a %table
without having to jump to %sql (perhaps a nice addition in this or a future PR).
It’s GREAT that %r print('%html') works with the Zeppelin display system! (as
well as the other display system methods)
Regarding rscala jar. You have a profile that will allow us to sync up the
version rscala, so that makes sense as well.
This too worked as expected. I specifically installed rscala (as you describe
in your docs) in the VM with:
curl https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz
-o /tmp/rscala_1.0.6.tar.gz
R CMD INSTALL /tmp/rscala_1.0.6.tar.gz
Installing rscala outside of the Zeppelin dependencies does seem to keep this
PR simpler, and reduces the licensing overhead required to get this PR through
(based on comments I see from others)
I would need to add the two rscala install lines above to PR#751 (I will add
this today)
https://github.com/apache/incubator-zeppelin/pull/751
Regarding the Interpreters. Just having %r as the our first interpreter
keyword makes sense. Loading knitr within the interpreter to enable rendering
(versus having a %knitr interpreter specifically) seems to keep things simple.
In summary - Looks good since everything in your sample R notebook (as well as
a few other tests I tried) worked for me using the VM script in PR#751.
The documentation also facilitated a smooth installation and allowed me to
create a repeatable script, that when paired with the VM worked as expected.
----
Jeff Steinmetz
Principal Architect
Akili Interactive
www.akiliinteractive.com <http://www.akiliinteractive.com/>
From
Eric Charles <e...@apache.org>
Subject
[DISCUSS] PR #208 - R Interpreter for Zeppelin
Date
Wed, 30 Dec 2015 14:04:33 GMT
Hi,
I had a look at https://github.com/apache/incubator-zeppelin/pull/208
(and related Github repo https://github.com/elbamos/Zeppelin-With-R [1])
Here are a few topics for discussion based on my experience developing
https://github.com/datalayer/zeppelin-R [2].
1. rscala jar not in Maven Repository
[1] copies the source (scala and R) code from rscala repo and
changes/extends/repackages it a bit. [2] declares the jar as system
scoped library. I recently had incompatibly issues between the 1.0.8
(the one you get since 2015-12-10 when you install rscala on your R
environment) and the 1.0.6 jar I am using part of the zeppelin-R build.
To avoid such issues, why not the user choosing the version via a
property at build time to fit the version he runs on its host? This will
also allow to benefit from the next rscala releases which fix bugs,
bring not features... This also means we don't have to copy the rscala
code in Zeppelin tree.
2. Interpreters
[1] proposes 2 interpreters %sparkr.r and %sparkr.knitr which are
implemented in their own module apart from the Spark one. To be aligned
the existing pyspark implementation, why not integrating the R code into
the Spark one? Any reason to keep 2 versions which does basically the
same? The unique magic keyword would then be %spark.r
3. Rendering TABLE plot when interpreter result is a dataframe
This may be confusing. What if I display a plot and simply want to print
the first 10 rows at the end of my code? To keep the same behavior as
the other interpreters, we could make this feature optional (disabled by
default, enabled via property).
Thx, Eric