Right now in sparkR.R the backend hostname is hard coded to "localhost" ( https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).
If we make that address configurable / parameterized, then a user can connect a remote Spark cluster with no need to have spark jars on their local machine. I have got this request from some R users. Their company has a Spark cluster (usually managed by another team), and they want to connect to it from their workstation (e.g., from within RStudio, etc). --Hossein On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I don't think the crux of the problem is about users who download the > source -- Spark's source distribution is clearly marked as something > that needs to be built and they can run `mvn -DskipTests -Psparkr > package` based on instructions in the Spark docs. > > The crux of the problem is that with a source or binary R package, the > client side the SparkR code needs the Spark JARs to be available. So > we can't just connect to a remote Spark cluster using just the R > scripts as we need the Scala classes around to create a Spark context > etc. > > But this is a use case that I've heard from a lot of users -- my take > is that this should be a separate package / layer on top of SparkR. > Dan Putler (cc'd) had a proposal on a client package for this and > maybe able to add more. > > Thanks > Shivaram > > On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fal...@gmail.com> wrote: > > Requiring users to download entire Spark distribution to connect to a > remote > > cluster (which is already running Spark) seems an over kill. Even for > most > > spark users who download Spark source, it is very unintuitive that they > need > > to run a script named "install-dev.sh" before they can run SparkR. > > > > --Hossein > > > > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui....@intel.com> wrote: > >> > >> SparkR package is not a standalone R package, as it is actually R API of > >> Spark and needs to co-operate with a matching version of Spark, so > exposing > >> it in CRAN does not ease use of R users as they need to download > matching > >> Spark distribution, unless we expose a bundled SparkR package to CRAN > >> (packageing with Spark), is this desirable? Actually, for normal users > who > >> are not developers, they are not required to download Spark source, > build > >> and install SparkR package. They just need to download a Spark > distribution, > >> and then use SparkR. > >> > >> > >> > >> For using SparkR in Rstudio, there is a documentation at > >> https://github.com/apache/spark/tree/master/R > >> > >> > >> > >> > >> > >> > >> > >> From: Hossein [mailto:fal...@gmail.com] > >> Sent: Thursday, September 24, 2015 1:42 AM > >> To: shiva...@eecs.berkeley.edu > >> Cc: Sun, Rui; dev@spark.apache.org > >> Subject: Re: SparkR package path > >> > >> > >> > >> Yes, I think exposing SparkR in CRAN can significantly expand the reach > of > >> both SparkR and Spark itself to a larger community of data scientists > (and > >> statisticians). > >> > >> > >> > >> I have been getting questions on how to use SparkR in RStudio. Most of > >> these folks have a Spark Cluster and wish to talk to it from RStudio. > While > >> that is a bigger task, for now, first step could be not requiring them > to > >> download Spark source and run a script that is named install-dev.sh. I > filed > >> SPARK-10776 to track this. > >> > >> > >> > >> > >> --Hossein > >> > >> > >> > >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman > >> <shiva...@eecs.berkeley.edu> wrote: > >> > >> As Rui says it would be good to understand the use case we want to > >> support (supporting CRAN installs could be one for example). I don't > >> think it should be very hard to do as the RBackend itself doesn't use > >> the R source files. The RRDD does use it and the value comes from > >> > >> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29 > >> AFAIK -- So we could introduce a new config flag that can be used for > >> this new mode. > >> > >> Thanks > >> Shivaram > >> > >> > >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui....@intel.com> wrote: > >> > Hossein, > >> > > >> > > >> > > >> > Any strong reason to download and install SparkR source package > >> > separately > >> > from the Spark distribution? > >> > > >> > An R user can simply download the spark distribution, which contains > >> > SparkR > >> > source and binary package, and directly use sparkR. No need to install > >> > SparkR package at all. > >> > > >> > > >> > > >> > From: Hossein [mailto:fal...@gmail.com] > >> > Sent: Tuesday, September 22, 2015 9:19 AM > >> > To: dev@spark.apache.org > >> > Subject: SparkR package path > >> > > >> > > >> > > >> > Hi dev list, > >> > > >> > > >> > > >> > SparkR backend assumes SparkR source files are located under > >> > "SPARK_HOME/R/lib/." This directory is created by running > >> > R/install-dev.sh. > >> > This setting makes sense for Spark developers, but if an R user > >> > downloads > >> > and installs SparkR source package, the source files are going to be > in > >> > placed different locations. > >> > > >> > > >> > > >> > In the R runtime it is easy to find location of package files using > >> > path.package("SparkR"). But we need to make some changes to R backend > >> > and/or > >> > spark-submit so that, JVM process learns the location of worker.R and > >> > daemon.R and shell.R from the R runtime. > >> > > >> > > >> > > >> > Do you think this change is feasible? > >> > > >> > > >> > > >> > Thanks, > >> > > >> > --Hossein > >> > >> > > > > >