Re: [VOTE] SPIP ML Pipelines in R
The vote passed with following +1. - Felix - Joseph - Xiangrui - Reynold Joseph has kindly volunteered to shepherd this. Thanks, --Hossein On Thu, Jun 14, 2018 at 1:32 PM Reynold Xin wrote: > +1 on the proposal. > > > On Fri, Jun 1, 2018 at 8:17 PM Hossein wrote: > >> Hi Shivaram, >> >> We converged on a CRAN release process that seems identical to current >> SparkR. >> >> --Hossein >> >> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> Hossein -- Can you clarify what the resolution on the repository / >>> release issue discussed on SPIP ? >>> >>> Shivaram >>> >>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung >>> wrote: >>> > +1 >>> > With my concerns in the SPIP discussion. >>> > >>> > >>> > From: Hossein >>> > Sent: Wednesday, May 30, 2018 2:03:03 PM >>> > To: dev@spark.apache.org >>> > Subject: [VOTE] SPIP ML Pipelines in R >>> > >>> > Hi, >>> > >>> > I started discussion thread for a new R package to expose MLlib >>> pipelines in >>> > R. >>> > >>> > To summarize we will work on utilities to generate R wrappers for MLlib >>> > pipeline API for a new R package. This will lower the burden for >>> exposing >>> > new API in future. >>> > >>> > Following the SPIP process, I am proposing the SPIP for a vote. >>> > >>> > +1: Let's go ahead and implement the SPIP. >>> > +0: Don't really care. >>> > -1: I do not think this is a good idea for the following reasons. >>> > >>> > Thanks, >>> > --Hossein >>> >> >>
Re: [VOTE] SPIP ML Pipelines in R
Hi Shivaram, We converged on a CRAN release process that seems identical to current SparkR. --Hossein On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Hossein -- Can you clarify what the resolution on the repository / > release issue discussed on SPIP ? > > Shivaram > > On Thu, May 31, 2018 at 9:06 AM, Felix Cheung > wrote: > > +1 > > With my concerns in the SPIP discussion. > > > > > > From: Hossein > > Sent: Wednesday, May 30, 2018 2:03:03 PM > > To: dev@spark.apache.org > > Subject: [VOTE] SPIP ML Pipelines in R > > > > Hi, > > > > I started discussion thread for a new R package to expose MLlib > pipelines in > > R. > > > > To summarize we will work on utilities to generate R wrappers for MLlib > > pipeline API for a new R package. This will lower the burden for exposing > > new API in future. > > > > Following the SPIP process, I am proposing the SPIP for a vote. > > > > +1: Let's go ahead and implement the SPIP. > > +0: Don't really care. > > -1: I do not think this is a good idea for the following reasons. > > > > Thanks, > > --Hossein >
[VOTE] SPIP ML Pipelines in R
Hi, I started discussion thread <http://apache-spark-developers-list.1001551.n3.nabble.com/ML-Pipelines-in-R-td24022.html#a24023> for a new R package to expose MLlib pipelines in R <https://issues.apache.org/jira/secure/attachment/12925281/SparkML_%20ML%20Pipelines%20in%20R-v3.pdf> . To summarize we will work on utilities to generate R wrappers for MLlib pipeline API for a new R package. This will lower the burden for exposing new API in future. Following the SPIP process <https://spark.apache.org/improvement-proposals.html>, I am proposing the SPIP <https://issues.apache.org/jira/browse/SPARK-24359> for a vote. +1: Let's go ahead and implement the SPIP. +0: Don't really care. -1: I do not think this is a good idea for the following reasons. Thanks, --Hossein
Re: SparkR was removed from CRAN on 2018-05-01
I guess this relates to our conversation on the SPIP <https://issues.apache.org/jira/browse/SPARK-24359>. When this happens, do we wait for a new minor release to submit it to CRAN again? --Hossein On Fri, May 25, 2018 at 5:11 PM, Felix Cheung wrote: > This is the fix > https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a9986 > 7c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6 > > I don’t have the email though. > > ------ > *From:* Hossein > *Sent:* Friday, May 25, 2018 10:58:42 AM > *To:* dev@spark.apache.org > *Subject:* SparkR was removed from CRAN on 2018-05-01 > > Would you please forward the email from CRAN? Is there a JIRA? > > Thanks, > --Hossein >
SparkR was removed from CRAN on 2018-05-01
Would you please forward the email from CRAN? Is there a JIRA? Thanks, --Hossein
Re: ML Pipelines in R
Correction: the SPIP is https://issues.apache.org/jira/browse/SPARK-24359 --Hossein On Tue, May 22, 2018 at 6:23 PM, Hossein <fal...@gmail.com> wrote: > Hi all, > > SparkR supports calling MLlib functionality with an R-friendly API. Since > Spark 1.5 the (new) SparkML API which is based on pipelines and parameters > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > > I just submitted a SPIP > <https://issues.apache.org/jira/browse/SPARK-21190> to propose a new R > package, SparkML, to be distributed along with SparkR as part of Apache > Spark. Please view the JIRA ticket and provide feedback & comments. > > Thanks, > --Hossein >
ML Pipelines in R
Hi all, SparkR supports calling MLlib functionality with an R-friendly API. Since Spark 1.5 the (new) SparkML API which is based on pipelines and parameters has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. I just submitted a SPIP <https://issues.apache.org/jira/browse/SPARK-21190> to propose a new R package, SparkML, to be distributed along with SparkR as part of Apache Spark. Please view the JIRA ticket and provide feedback & comments. Thanks, --Hossein
Re: SparkR dataframe UDF
User defined functions written in R are not supposed yet. You can implement your UDF in Scala, register it in sqlContext and use it in SparkR, provided that you share your context between R and Scala. --Hossein On Friday, October 2, 2015, Renyi Xiong <renyixio...@gmail.com> wrote: > Hi Shiva, > > Is Dataframe UDF implemented in SparkR yet? - I could not find it in below > URL > > https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R > > Thanks, > Renyi. > -- --Hossein
Re: SparkR package path
Requiring users to download entire Spark distribution to connect to a remote cluster (which is already running Spark) seems an over kill. Even for most spark users who download Spark source, it is very unintuitive that they need to run a script named "install-dev.sh" before they can run SparkR. --Hossein On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui@intel.com> wrote: > SparkR package is not a standalone R package, as it is actually R API of > Spark and needs to co-operate with a matching version of Spark, so exposing > it in CRAN does not ease use of R users as they need to download matching > Spark distribution, unless we expose a bundled SparkR package to CRAN > (packageing with Spark), is this desirable? Actually, for normal users who > are not developers, they are not required to download Spark source, build > and install SparkR package. They just need to download a Spark > distribution, and then use SparkR. > > > > For using SparkR in Rstudio, there is a documentation at > https://github.com/apache/spark/tree/master/R > > > > > > > > *From:* Hossein [mailto:fal...@gmail.com] > *Sent:* Thursday, September 24, 2015 1:42 AM > *To:* shiva...@eecs.berkeley.edu > *Cc:* Sun, Rui; dev@spark.apache.org > *Subject:* Re: SparkR package path > > > > Yes, I think exposing SparkR in CRAN can significantly expand the reach of > both SparkR and Spark itself to a larger community of data scientists (and > statisticians). > > > > I have been getting questions on how to use SparkR in RStudio. Most of > these folks have a Spark Cluster and wish to talk to it from RStudio. While > that is a bigger task, for now, first step could be not requiring them to > download Spark source and run a script that is named install-dev.sh. I > filed SPARK-10776 to track this. > > > > > --Hossein > > > > On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > > As Rui says it would be good to understand the use case we want to > support (supporting CRAN installs could be one for example). I don't > think it should be very hard to do as the RBackend itself doesn't use > the R source files. The RRDD does use it and the value comes from > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29 > AFAIK -- So we could introduce a new config flag that can be used for > this new mode. > > Thanks > Shivaram > > > On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote: > > Hossein, > > > > > > > > Any strong reason to download and install SparkR source package > separately > > from the Spark distribution? > > > > An R user can simply download the spark distribution, which contains > SparkR > > source and binary package, and directly use sparkR. No need to install > > SparkR package at all. > > > > > > > > From: Hossein [mailto:fal...@gmail.com] > > Sent: Tuesday, September 22, 2015 9:19 AM > > To: dev@spark.apache.org > > Subject: SparkR package path > > > > > > > > Hi dev list, > > > > > > > > SparkR backend assumes SparkR source files are located under > > "SPARK_HOME/R/lib/." This directory is created by running > R/install-dev.sh. > > This setting makes sense for Spark developers, but if an R user downloads > > and installs SparkR source package, the source files are going to be in > > placed different locations. > > > > > > > > In the R runtime it is easy to find location of package files using > > path.package("SparkR"). But we need to make some changes to R backend > and/or > > spark-submit so that, JVM process learns the location of worker.R and > > daemon.R and shell.R from the R runtime. > > > > > > > > Do you think this change is feasible? > > > > > > > > Thanks, > > > > --Hossein > > >
Re: SparkR package path
Right now in sparkR.R the backend hostname is hard coded to "localhost" ( https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156). If we make that address configurable / parameterized, then a user can connect a remote Spark cluster with no need to have spark jars on their local machine. I have got this request from some R users. Their company has a Spark cluster (usually managed by another team), and they want to connect to it from their workstation (e.g., from within RStudio, etc). --Hossein On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I don't think the crux of the problem is about users who download the > source -- Spark's source distribution is clearly marked as something > that needs to be built and they can run `mvn -DskipTests -Psparkr > package` based on instructions in the Spark docs. > > The crux of the problem is that with a source or binary R package, the > client side the SparkR code needs the Spark JARs to be available. So > we can't just connect to a remote Spark cluster using just the R > scripts as we need the Scala classes around to create a Spark context > etc. > > But this is a use case that I've heard from a lot of users -- my take > is that this should be a separate package / layer on top of SparkR. > Dan Putler (cc'd) had a proposal on a client package for this and > maybe able to add more. > > Thanks > Shivaram > > On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fal...@gmail.com> wrote: > > Requiring users to download entire Spark distribution to connect to a > remote > > cluster (which is already running Spark) seems an over kill. Even for > most > > spark users who download Spark source, it is very unintuitive that they > need > > to run a script named "install-dev.sh" before they can run SparkR. > > > > --Hossein > > > > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui@intel.com> wrote: > >> > >> SparkR package is not a standalone R package, as it is actually R API of > >> Spark and needs to co-operate with a matching version of Spark, so > exposing > >> it in CRAN does not ease use of R users as they need to download > matching > >> Spark distribution, unless we expose a bundled SparkR package to CRAN > >> (packageing with Spark), is this desirable? Actually, for normal users > who > >> are not developers, they are not required to download Spark source, > build > >> and install SparkR package. They just need to download a Spark > distribution, > >> and then use SparkR. > >> > >> > >> > >> For using SparkR in Rstudio, there is a documentation at > >> https://github.com/apache/spark/tree/master/R > >> > >> > >> > >> > >> > >> > >> > >> From: Hossein [mailto:fal...@gmail.com] > >> Sent: Thursday, September 24, 2015 1:42 AM > >> To: shiva...@eecs.berkeley.edu > >> Cc: Sun, Rui; dev@spark.apache.org > >> Subject: Re: SparkR package path > >> > >> > >> > >> Yes, I think exposing SparkR in CRAN can significantly expand the reach > of > >> both SparkR and Spark itself to a larger community of data scientists > (and > >> statisticians). > >> > >> > >> > >> I have been getting questions on how to use SparkR in RStudio. Most of > >> these folks have a Spark Cluster and wish to talk to it from RStudio. > While > >> that is a bigger task, for now, first step could be not requiring them > to > >> download Spark source and run a script that is named install-dev.sh. I > filed > >> SPARK-10776 to track this. > >> > >> > >> > >> > >> --Hossein > >> > >> > >> > >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman > >> <shiva...@eecs.berkeley.edu> wrote: > >> > >> As Rui says it would be good to understand the use case we want to > >> support (supporting CRAN installs could be one for example). I don't > >> think it should be very hard to do as the RBackend itself doesn't use > >> the R source files. The RRDD does use it and the value comes from > >> > >> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29 > >> AFAIK -- So we could introduce a new config flag that can be used for > >> this new mode. > >> > >> Thanks > >> Shivaram > >> > >> > >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.
Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
+1 tested SparkR on Mac and Linux. --Hossein On Thu, Sep 24, 2015 at 3:10 PM, Xiangrui Meng <men...@gmail.com> wrote: > +1. Checked user guide and API doc, and ran some MLlib and SparkR > examples. -Xiangrui > > On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote: > > I'm going to +1 this myself. Tested on my laptop. > > > > > > > > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin <r...@databricks.com> > wrote: > >> > >> I forked a new thread for this. Please discuss NOTICE file related > things > >> there so it doesn't hijack this thread. > >> > >> > >> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen <so...@cloudera.com> wrote: > >>> > >>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas <rhil...@us.ibm.com> > >>> wrote: > >>> > Under your guidance, I would be happy to help compile a NOTICE file > >>> > which > >>> > follows the pattern used by Derby and the JDK. This effort might > >>> > proceed in > >>> > parallel with vetting 1.5.1 and could be targeted at a later release > >>> > vehicle. I don't think that the ASF's exposure is greatly increased > by > >>> > one > >>> > more release which follows the old pattern. > >>> > >>> I'd prefer to use the ASF's preferred pattern, no? That's what we've > >>> been trying to do and seems like we're even required to do so, not > >>> follow a different convention. There is some specific guidance there > >>> about what to add, and not add, to these files. Specifically, because > >>> the AL2 requires downstream projects to embed the contents of NOTICE, > >>> the guidance is to only include elements in NOTICE that must appear > >>> there. > >>> > >>> Put it this way -- what would you like to change specifically? (you > >>> can start another thread for that) > >>> > >>> >> My assessment (just looked before I saw Sean's email) is the same as > >>> >> his. The NOTICE file embeds other projects' licenses. > >>> > > >>> > This may be where our perspectives diverge. I did not find those > >>> > licenses > >>> > embedded in the NOTICE file. As I see it, the licenses are cited but > >>> > not > >>> > included. > >>> > >>> Pretty sure that was meant to say that NOTICE embeds other projects' > >>> "notices", not licenses. And those notices can have all kinds of > >>> stuff, including licenses. > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: dev-h...@spark.apache.org > >>> > >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: SparkR package path
Yes, I think exposing SparkR in CRAN can significantly expand the reach of both SparkR and Spark itself to a larger community of data scientists (and statisticians). I have been getting questions on how to use SparkR in RStudio. Most of these folks have a Spark Cluster and wish to talk to it from RStudio. While that is a bigger task, for now, first step could be not requiring them to download Spark source and run a script that is named install-dev.sh. I filed SPARK-10776 to track this. --Hossein On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > As Rui says it would be good to understand the use case we want to > support (supporting CRAN installs could be one for example). I don't > think it should be very hard to do as the RBackend itself doesn't use > the R source files. The RRDD does use it and the value comes from > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29 > AFAIK -- So we could introduce a new config flag that can be used for > this new mode. > > Thanks > Shivaram > > On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote: > > Hossein, > > > > > > > > Any strong reason to download and install SparkR source package > separately > > from the Spark distribution? > > > > An R user can simply download the spark distribution, which contains > SparkR > > source and binary package, and directly use sparkR. No need to install > > SparkR package at all. > > > > > > > > From: Hossein [mailto:fal...@gmail.com] > > Sent: Tuesday, September 22, 2015 9:19 AM > > To: dev@spark.apache.org > > Subject: SparkR package path > > > > > > > > Hi dev list, > > > > > > > > SparkR backend assumes SparkR source files are located under > > "SPARK_HOME/R/lib/." This directory is created by running > R/install-dev.sh. > > This setting makes sense for Spark developers, but if an R user downloads > > and installs SparkR source package, the source files are going to be in > > placed different locations. > > > > > > > > In the R runtime it is easy to find location of package files using > > path.package("SparkR"). But we need to make some changes to R backend > and/or > > spark-submit so that, JVM process learns the location of worker.R and > > daemon.R and shell.R from the R runtime. > > > > > > > > Do you think this change is feasible? > > > > > > > > Thanks, > > > > --Hossein >
SparkR package path
Hi dev list, SparkR backend assumes SparkR source files are located under "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh. This setting makes sense for Spark developers, but if an R user downloads and installs SparkR source package, the source files are going to be in placed different locations. In the R runtime it is easy to find location of package files using path.package("SparkR"). But we need to make some changes to R backend and/or spark-submit so that, JVM process learns the location of worker.R and daemon.R and shell.R from the R runtime. Do you think this change is feasible? Thanks, --Hossein