Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Hossein
The vote passed with following +1.

- Felix
- Joseph
- Xiangrui
- Reynold

Joseph has kindly volunteered to shepherd this.

Thanks,
--Hossein


On Thu, Jun 14, 2018 at 1:32 PM Reynold Xin  wrote:

> +1 on the proposal.
>
>
> On Fri, Jun 1, 2018 at 8:17 PM Hossein  wrote:
>
>> Hi Shivaram,
>>
>> We converged on a CRAN release process that seems identical to current
>> SparkR.
>>
>> --Hossein
>>
>> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Hossein -- Can you clarify what the resolution on the repository /
>>> release issue discussed on SPIP ?
>>>
>>> Shivaram
>>>
>>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>>> wrote:
>>> > +1
>>> > With my concerns in the SPIP discussion.
>>> >
>>> > 
>>> > From: Hossein 
>>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>>> > To: dev@spark.apache.org
>>> > Subject: [VOTE] SPIP ML Pipelines in R
>>> >
>>> > Hi,
>>> >
>>> > I started discussion thread for a new R package to expose MLlib
>>> pipelines in
>>> > R.
>>> >
>>> > To summarize we will work on utilities to generate R wrappers for MLlib
>>> > pipeline API for a new R package. This will lower the burden for
>>> exposing
>>> > new API in future.
>>> >
>>> > Following the SPIP process, I am proposing the SPIP for a vote.
>>> >
>>> > +1: Let's go ahead and implement the SPIP.
>>> > +0: Don't really care.
>>> > -1: I do not think this is a good idea for the following reasons.
>>> >
>>> > Thanks,
>>> > --Hossein
>>>
>>
>>


Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Hossein
Hi Shivaram,

We converged on a CRAN release process that seems identical to current
SparkR.

--Hossein

On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hossein -- Can you clarify what the resolution on the repository /
> release issue discussed on SPIP ?
>
> Shivaram
>
> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
> wrote:
> > +1
> > With my concerns in the SPIP discussion.
> >
> > 
> > From: Hossein 
> > Sent: Wednesday, May 30, 2018 2:03:03 PM
> > To: dev@spark.apache.org
> > Subject: [VOTE] SPIP ML Pipelines in R
> >
> > Hi,
> >
> > I started discussion thread for a new R package to expose MLlib
> pipelines in
> > R.
> >
> > To summarize we will work on utilities to generate R wrappers for MLlib
> > pipeline API for a new R package. This will lower the burden for exposing
> > new API in future.
> >
> > Following the SPIP process, I am proposing the SPIP for a vote.
> >
> > +1: Let's go ahead and implement the SPIP.
> > +0: Don't really care.
> > -1: I do not think this is a good idea for the following reasons.
> >
> > Thanks,
> > --Hossein
>


[VOTE] SPIP ML Pipelines in R

2018-05-30 Thread Hossein
Hi,

I started discussion thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/ML-Pipelines-in-R-td24022.html#a24023>
for a new R package to expose MLlib pipelines in R
<https://issues.apache.org/jira/secure/attachment/12925281/SparkML_%20ML%20Pipelines%20in%20R-v3.pdf>
.

To summarize we will work on utilities to generate R wrappers for MLlib
pipeline API for a new R package. This will lower the burden for exposing
new API in future.

Following the SPIP process
<https://spark.apache.org/improvement-proposals.html>, I am proposing the
SPIP <https://issues.apache.org/jira/browse/SPARK-24359> for a vote.

+1: Let's go ahead and implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.

Thanks,
--Hossein


Re: SparkR was removed from CRAN on 2018-05-01

2018-05-29 Thread Hossein
I guess this relates to our conversation on the SPIP
<https://issues.apache.org/jira/browse/SPARK-24359>. When this happens, do
we wait for a new minor release to submit it to CRAN again?

--Hossein

On Fri, May 25, 2018 at 5:11 PM, Felix Cheung 
wrote:

> This is the fix
> https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a9986
> 7c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> I don’t have the email though.
>
> ------
> *From:* Hossein 
> *Sent:* Friday, May 25, 2018 10:58:42 AM
> *To:* dev@spark.apache.org
> *Subject:* SparkR was removed from CRAN on 2018-05-01
>
> Would you please forward the email from CRAN? Is there a JIRA?
>
> Thanks,
> --Hossein
>


SparkR was removed from CRAN on 2018-05-01

2018-05-25 Thread Hossein
Would you please forward the email from CRAN? Is there a JIRA?

Thanks,
--Hossein


Re: ML Pipelines in R

2018-05-22 Thread Hossein
Correction: the SPIP is https://issues.apache.org/jira/browse/SPARK-24359


--Hossein

On Tue, May 22, 2018 at 6:23 PM, Hossein <fal...@gmail.com> wrote:

> Hi all,
>
> SparkR supports calling MLlib functionality with an R-friendly API. Since
> Spark 1.5 the (new) SparkML API which is based on pipelines and parameters
> has matured significantly. It allows users build and maintain complicated
> machine learning pipelines. A lot of this functionality is difficult to
> expose using the simple formula-based API in SparkR.
>
> I just submitted a SPIP
> <https://issues.apache.org/jira/browse/SPARK-21190> to propose a new R
> package, SparkML, to be distributed along with SparkR as part of Apache
> Spark. Please view the JIRA ticket and provide feedback & comments.
>
> Thanks,
> --Hossein
>


ML Pipelines in R

2018-05-22 Thread Hossein
Hi all,

SparkR supports calling MLlib functionality with an R-friendly API. Since
Spark 1.5 the (new) SparkML API which is based on pipelines and parameters
has matured significantly. It allows users build and maintain complicated
machine learning pipelines. A lot of this functionality is difficult to
expose using the simple formula-based API in SparkR.

I just submitted a SPIP <https://issues.apache.org/jira/browse/SPARK-21190>
to propose a new R package, SparkML, to be distributed along with SparkR as
part of Apache Spark. Please view the JIRA ticket and provide feedback &
comments.

Thanks,
--Hossein


Re: SparkR dataframe UDF

2015-10-06 Thread Hossein
User defined functions written in R are not supposed yet. You can implement
your UDF in Scala, register it in sqlContext and use it in SparkR, provided
that you share your context between R and Scala.

--Hossein

On Friday, October 2, 2015, Renyi Xiong <renyixio...@gmail.com> wrote:

> Hi Shiva,
>
> Is Dataframe UDF implemented in SparkR yet? - I could not find it in below
> URL
>
> https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R
>
> Thanks,
> Renyi.
>


-- 
--Hossein


Re: SparkR package path

2015-09-24 Thread Hossein
Requiring users to download entire Spark distribution to connect to a
remote cluster (which is already running Spark) seems an over kill. Even
for most spark users who download Spark source, it is very unintuitive that
they need to run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui@intel.com> wrote:

> SparkR package is not a standalone R package, as it is actually R API of
> Spark and needs to co-operate with a matching version of Spark, so exposing
> it in CRAN does not ease use of R users as they need to download matching
> Spark distribution, unless we expose a bundled SparkR package to CRAN
> (packageing with Spark), is this desirable? Actually, for normal users who
> are not developers, they are not required to download Spark source, build
> and install SparkR package. They just need to download a Spark
> distribution, and then use SparkR.
>
>
>
> For using SparkR in Rstudio, there is a documentation at
> https://github.com/apache/spark/tree/master/R
>
>
>
>
>
>
>
> *From:* Hossein [mailto:fal...@gmail.com]
> *Sent:* Thursday, September 24, 2015 1:42 AM
> *To:* shiva...@eecs.berkeley.edu
> *Cc:* Sun, Rui; dev@spark.apache.org
> *Subject:* Re: SparkR package path
>
>
>
> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
> both SparkR and Spark itself to a larger community of data scientists (and
> statisticians).
>
>
>
> I have been getting questions on how to use SparkR in RStudio. Most of
> these folks have a Spark Cluster and wish to talk to it from RStudio. While
> that is a bigger task, for now, first step could be not requiring them to
> download Spark source and run a script that is named install-dev.sh. I
> filed SPARK-10776 to track this.
>
>
>
>
> --Hossein
>
>
>
> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:fal...@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>
>
>


Re: SparkR package path

2015-09-24 Thread Hossein
Right now in sparkR.R the backend hostname is hard coded to "localhost" (
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can
connect a remote Spark cluster with no need to have spark jars on their
local machine. I have got this request from some R users. Their company has
a Spark cluster (usually managed by another team), and they want to connect
to it from their workstation (e.g., from within RStudio, etc).



--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I don't think the crux of the problem is about users who download the
> source -- Spark's source distribution is clearly marked as something
> that needs to be built and they can run `mvn -DskipTests -Psparkr
> package` based on instructions in the Spark docs.
>
> The crux of the problem is that with a source or binary R package, the
> client side the SparkR code needs the Spark JARs to be available. So
> we can't just connect to a remote Spark cluster using just the R
> scripts as we need the Scala classes around to create a Spark context
> etc.
>
> But this is a use case that I've heard from a lot of users -- my take
> is that this should be a separate package / layer on top of SparkR.
> Dan Putler (cc'd) had a proposal on a client package for this and
> maybe able to add more.
>
> Thanks
> Shivaram
>
> On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fal...@gmail.com> wrote:
> > Requiring users to download entire Spark distribution to connect to a
> remote
> > cluster (which is already running Spark) seems an over kill. Even for
> most
> > spark users who download Spark source, it is very unintuitive that they
> need
> > to run a script named "install-dev.sh" before they can run SparkR.
> >
> > --Hossein
> >
> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui@intel.com> wrote:
> >>
> >> SparkR package is not a standalone R package, as it is actually R API of
> >> Spark and needs to co-operate with a matching version of Spark, so
> exposing
> >> it in CRAN does not ease use of R users as they need to download
> matching
> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
> >> (packageing with Spark), is this desirable? Actually, for normal users
> who
> >> are not developers, they are not required to download Spark source,
> build
> >> and install SparkR package. They just need to download a Spark
> distribution,
> >> and then use SparkR.
> >>
> >>
> >>
> >> For using SparkR in Rstudio, there is a documentation at
> >> https://github.com/apache/spark/tree/master/R
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Hossein [mailto:fal...@gmail.com]
> >> Sent: Thursday, September 24, 2015 1:42 AM
> >> To: shiva...@eecs.berkeley.edu
> >> Cc: Sun, Rui; dev@spark.apache.org
> >> Subject: Re: SparkR package path
> >>
> >>
> >>
> >> Yes, I think exposing SparkR in CRAN can significantly expand the reach
> of
> >> both SparkR and Spark itself to a larger community of data scientists
> (and
> >> statisticians).
> >>
> >>
> >>
> >> I have been getting questions on how to use SparkR in RStudio. Most of
> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
> While
> >> that is a bigger task, for now, first step could be not requiring them
> to
> >> download Spark source and run a script that is named install-dev.sh. I
> filed
> >> SPARK-10776 to track this.
> >>
> >>
> >>
> >>
> >> --Hossein
> >>
> >>
> >>
> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
> >> <shiva...@eecs.berkeley.edu> wrote:
> >>
> >> As Rui says it would be good to understand the use case we want to
> >> support (supporting CRAN installs could be one for example). I don't
> >> think it should be very hard to do as the RBackend itself doesn't use
> >> the R source files. The RRDD does use it and the value comes from
> >>
> >>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> >> AFAIK -- So we could introduce a new config flag that can be used for
> >> this new mode.
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Hossein
+1 tested SparkR on Mac and Linux.

--Hossein

On Thu, Sep 24, 2015 at 3:10 PM, Xiangrui Meng <men...@gmail.com> wrote:

> +1. Checked user guide and API doc, and ran some MLlib and SparkR
> examples. -Xiangrui
>
> On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote:
> > I'm going to +1 this myself. Tested on my laptop.
> >
> >
> >
> > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >>
> >> I forked a new thread for this. Please discuss NOTICE file related
> things
> >> there so it doesn't hijack this thread.
> >>
> >>
> >> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>
> >>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas <rhil...@us.ibm.com>
> >>> wrote:
> >>> > Under your guidance, I would be happy to help compile a NOTICE file
> >>> > which
> >>> > follows the pattern used by Derby and the JDK. This effort might
> >>> > proceed in
> >>> > parallel with vetting 1.5.1 and could be targeted at a later release
> >>> > vehicle. I don't think that the ASF's exposure is greatly increased
> by
> >>> > one
> >>> > more release which follows the old pattern.
> >>>
> >>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
> >>> been trying to do and seems like we're even required to do so, not
> >>> follow a different convention. There is some specific guidance there
> >>> about what to add, and not add, to these files. Specifically, because
> >>> the AL2 requires downstream projects to embed the contents of NOTICE,
> >>> the guidance is to only include elements in NOTICE that must appear
> >>> there.
> >>>
> >>> Put it this way -- what would you like to change specifically? (you
> >>> can start another thread for that)
> >>>
> >>> >> My assessment (just looked before I saw Sean's email) is the same as
> >>> >> his. The NOTICE file embeds other projects' licenses.
> >>> >
> >>> > This may be where our perspectives diverge. I did not find those
> >>> > licenses
> >>> > embedded in the NOTICE file. As I see it, the licenses are cited but
> >>> > not
> >>> > included.
> >>>
> >>> Pretty sure that was meant to say that NOTICE embeds other projects'
> >>> "notices", not licenses. And those notices can have all kinds of
> >>> stuff, including licenses.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: SparkR package path

2015-09-23 Thread Hossein
Yes, I think exposing SparkR in CRAN can significantly expand the reach of
both SparkR and Spark itself to a larger community of data scientists (and
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of
these folks have a Spark Cluster and wish to talk to it from RStudio. While
that is a bigger task, for now, first step could be not requiring them to
download Spark source and run a script that is named install-dev.sh. I
filed SPARK-10776 to track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:fal...@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>


SparkR package path

2015-09-21 Thread Hossein
Hi dev list,

SparkR backend assumes SparkR source files are located under
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
This setting makes sense for Spark developers, but if an R user downloads
and installs SparkR source package, the source files are going to be in
placed different locations.

In the R runtime it is easy to find location of package files using
path.package("SparkR"). But we need to make some changes to R backend
and/or spark-submit so that, JVM process learns the location of worker.R
and daemon.R and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein