Re: SparkR package path

Hossein Thu, 24 Sep 2015 14:10:29 -0700

Right now in sparkR.R the backend hostname is hard coded to "localhost" (
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).


If we make that address configurable / parameterized, then a user can
connect a remote Spark cluster with no need to have spark jars on their
local machine. I have got this request from some R users. Their company has
a Spark cluster (usually managed by another team), and they want to connect
to it from their workstation (e.g., from within RStudio, etc).



--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
[email protected]> wrote:

> I don't think the crux of the problem is about users who download the
> source -- Spark's source distribution is clearly marked as something
> that needs to be built and they can run `mvn -DskipTests -Psparkr
> package` based on instructions in the Spark docs.
>
> The crux of the problem is that with a source or binary R package, the
> client side the SparkR code needs the Spark JARs to be available. So
> we can't just connect to a remote Spark cluster using just the R
> scripts as we need the Scala classes around to create a Spark context
> etc.
>
> But this is a use case that I've heard from a lot of users -- my take
> is that this should be a separate package / layer on top of SparkR.
> Dan Putler (cc'd) had a proposal on a client package for this and
> maybe able to add more.
>
> Thanks
> Shivaram
>
> On Thu, Sep 24, 2015 at 11:36 AM, Hossein <[email protected]> wrote:
> > Requiring users to download entire Spark distribution to connect to a
> remote
> > cluster (which is already running Spark) seems an over kill. Even for
> most
> > spark users who download Spark source, it is very unintuitive that they
> need
> > to run a script named "install-dev.sh" before they can run SparkR.
> >
> > --Hossein
> >
> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <[email protected]> wrote:
> >>
> >> SparkR package is not a standalone R package, as it is actually R API of
> >> Spark and needs to co-operate with a matching version of Spark, so
> exposing
> >> it in CRAN does not ease use of R users as they need to download
> matching
> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
> >> (packageing with Spark), is this desirable? Actually, for normal users
> who
> >> are not developers, they are not required to download Spark source,
> build
> >> and install SparkR package. They just need to download a Spark
> distribution,
> >> and then use SparkR.
> >>
> >>
> >>
> >> For using SparkR in Rstudio, there is a documentation at
> >> https://github.com/apache/spark/tree/master/R
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Hossein [mailto:[email protected]]
> >> Sent: Thursday, September 24, 2015 1:42 AM
> >> To: [email protected]
> >> Cc: Sun, Rui; [email protected]
> >> Subject: Re: SparkR package path
> >>
> >>
> >>
> >> Yes, I think exposing SparkR in CRAN can significantly expand the reach
> of
> >> both SparkR and Spark itself to a larger community of data scientists
> (and
> >> statisticians).
> >>
> >>
> >>
> >> I have been getting questions on how to use SparkR in RStudio. Most of
> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
> While
> >> that is a bigger task, for now, first step could be not requiring them
> to
> >> download Spark source and run a script that is named install-dev.sh. I
> filed
> >> SPARK-10776 to track this.
> >>
> >>
> >>
> >>
> >> --Hossein
> >>
> >>
> >>
> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
> >> <[email protected]> wrote:
> >>
> >> As Rui says it would be good to understand the use case we want to
> >> support (supporting CRAN installs could be one for example). I don't
> >> think it should be very hard to do as the RBackend itself doesn't use
> >> the R source files. The RRDD does use it and the value comes from
> >>
> >>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> >> AFAIK -- So we could introduce a new config flag that can be used for
> >> this new mode.
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <[email protected]> wrote:
> >> > Hossein,
> >> >
> >> >
> >> >
> >> > Any strong reason to download and install SparkR source package
> >> > separately
> >> > from the Spark distribution?
> >> >
> >> > An R user can simply download the spark distribution, which contains
> >> > SparkR
> >> > source and binary package, and directly use sparkR. No need to install
> >> > SparkR package at all.
> >> >
> >> >
> >> >
> >> > From: Hossein [mailto:[email protected]]
> >> > Sent: Tuesday, September 22, 2015 9:19 AM
> >> > To: [email protected]
> >> > Subject: SparkR package path
> >> >
> >> >
> >> >
> >> > Hi dev list,
> >> >
> >> >
> >> >
> >> > SparkR backend assumes SparkR source files are located under
> >> > "SPARK_HOME/R/lib/." This directory is created by running
> >> > R/install-dev.sh.
> >> > This setting makes sense for Spark developers, but if an R user
> >> > downloads
> >> > and installs SparkR source package, the source files are going to be
> in
> >> > placed different locations.
> >> >
> >> >
> >> >
> >> > In the R runtime it is easy to find location of package files using
> >> > path.package("SparkR"). But we need to make some changes to R backend
> >> > and/or
> >> > spark-submit so that, JVM process learns the location of worker.R and
> >> > daemon.R and shell.R from the R runtime.
> >> >
> >> >
> >> >
> >> > Do you think this change is feasible?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > --Hossein
> >>
> >>
> >
> >
>

Re: SparkR package path

Reply via email to