[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150
 ] 

Yanbo Liang commented on SPARK-17428:
-

Yeah, I agree to start with something simple and iterate later. I will do some 
experiments to verify whether it works well for the my use case. Thanks for all 
your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475943#comment-15475943
 ] 

Shivaram Venkataraman commented on SPARK-17428:
---

I think there are bunch of issues being discussed here. My initial take would 
be to add support for something simple and then iterate based on user feedback. 
Given that R users generally don't know / care much about package version 
numbers I'd say an initial cut that handles two flags in spark-submit 

(a) a list of package names and calls `install.packages` on each machine with 
them
(b) a list of package tar.gz that are installed with `R CMD INSTALL` on each 
machine 

We can also make the package installs lazy, i.e. they only get run on a worker 
when there is a R worker process launched there. Will this meet the user needs 
you have in mind [~yanboliang] ?

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475664#comment-15475664
 ] 

Jeff Zhang commented on SPARK-17428:


Found another elegant way to specify version, using devtools
https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
{code}
require(devtools)
install_version("ggplot2", version = "0.9.1", repos = 
"http://cran.us.r-project.org;)
{code}

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475645#comment-15475645
 ] 

Jeff Zhang commented on SPARK-17428:


I just link the jira of python virtualenv.  It seems R support virtualenv 
natively. Install.packages can specify the version, installation dest folder. 
And it is isolated cross users. I think there's 2 scenarios for SparkR 
environment. One is cluster has internet access, another is without internet 
access.
If the cluster has internet access, then I think we can call install.packages 
directly. 
{code}
install.packages("dplyr", lib="")
library(dplyr, lib.loc="")
{code}
If the cluster doesn't have internet access, then the driver can first download 
these package tarball and add them through --files. And executor will try to 
compile and install these packages
{code}
install.packages(, repos = NULL, type="source", 
lib="")
library(dplyr, lib.loc="")
{code}
For this scenario, if the package has dependencies, it would still try to 
download its dependencies from internet. Or user has to manually figure out its 
dependencies and add them in the spark app.   


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475630#comment-15475630
 ] 

Jeff Zhang commented on SPARK-17428:


Source code url needs to be specified for version. 
http://stackoverflow.com/questions/17082341/installing-older-version-of-r-package


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612
 ] 

Felix Cheung commented on SPARK-17428:
--

I don't think I see a way to specify a version number for install.packages in R?

Python does compile code - installing packages with pip compiles the python 
scripts. https://www.google.com/search?q=pyc
And also many packages have heavy native components which will not work without 
installing as root (or heavy hacking), eg. matplotlib, scipy.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593
 ] 

Sun Rui commented on SPARK-17428:
-

I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain be ready on worker nodes.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474587#comment-15474587
 ] 

Felix Cheung commented on SPARK-17428:
--

Agree with above. And to be clear, packrat is still calling install.packages so 
it won't be different how this is handled regarding package directory (lib 
parameter to install.packages) or permission/access
https://github.com/rstudio/packrat/blob/master/R/install.R#L69

We are likely going to prefer having private packages under the application 
directory in the case of YARN, so they will get clean up along with the 
application.

It seems like the original point of this JIRA is around private packages and 
installation/deployment - I think we would agree we could handle that (or 
SparkR in YARN already can do that)

My point is though the benefit of such package management system is really with 
the exact version that one can control.

But even then, building packages from source on worker machine could be 
problematic (this applies both to packrat, or calls to install.packages):
https://rstudio.github.io/packrat/limitations.html
- I'm not sure we should assume all worker machines in enterprises have C 
compiler or that the user running Spark have permission to build source code.

I don't know where we are at with PySpark but I'd be very interested in seeing 
how that is resolved - I think both Python and R face similar constraints in 
terms of deployment/package building, versioning, heterogeneous machine 
architecture and so on.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474193#comment-15474193
 ] 

Shivaram Venkataraman commented on SPARK-17428:
---

I agree with [~sunrui] - Just to make it more concrete, something like
{code}
install.packages("dplyr", lib="/tmp/")
library(dplyr, lib.loc="/tmp")
{code}

creates `/tmp/dplyr` and puts the package there (no root required for this). We 
can also automatically search `/tmp` for packages by adding it to `.libPaths()` 
as well. Not that /tmp is just an example here and we can replace this with 
YARN local dir etc.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474162#comment-15474162
 ] 

Sun Rui commented on SPARK-17428:
-

for your point 1, If we specify a normal temporary directory for installing on 
executor nodes,seems no root privilege is required

for your point 2, If we specify a normal temporary directory for installing on 
executor nodes, no pollution to the executors'R libraries.

for your point 3, this is a concern. Typically for client deployment mode, 
where the driver may be out of the cluster, and may have a different 
architecture from nodes of the cluster. This needs more discussion.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643
 ] 

Yanbo Liang commented on SPARK-17428:
-

[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may be need more experiments. If 
this proposal make sense, I can work on this feature. Please feel free to let 
me know what you concern about. Thanks!

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472995#comment-15472995
 ] 

Felix Cheung commented on SPARK-17428:
--

PySpark in fact has a on-going PR on supporting `virtualenv` and `wheel` but I 
don't think that is fully resolved yet for Python.

I think it is an interesting use case. The advantage of a package management 
tool is the ability to control exact version of packages - install.packages 
would just pick the latest which could cause issues between different nodes in 
the cluster.

I also think we need think deeper on this - I have run into issues a lot with 
Python or R packages that require native dependencies and compilations, and 
often time only running as root. I'm not sure if we want Spark jobs to run as 
root.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471040#comment-15471040
 ] 

Shivaram Venkataraman commented on SPARK-17428:
---

Yeah so it should be relatively easy to install any R package from CRAN / a set 
of repos to a specified directory. The `lib` option at 
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/install.packages.html 
can be used for this.

So one way to do this would be to take in the names of R packages and / or 
tar.gz files and invoke `install.packages` with the appropriate YARN local dir 
or Mesos local dir passed in as `lib` 

I think [~sunrui] has a good point about compiling packages at one machine vs. 
many machines. I think compiling only on driver will save some work  -- Just as 
a point of reference how do we handle source packages in PySpark ?

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469806#comment-15469806
 ] 

Sun Rui commented on SPARK-17428:
-

[~yanboliang] Allowing pass dependent R packages to executors is a convenient 
feature for users. However, maybe there is no need for a third-party R package 
for isolation because the underlying cluster managers may have built-in support 
for it. For example,  YARN Local Resource , and Mesos Sandbox. Actually, SparkR 
on YARN has already supported passing  dependent R packages to executors. The 
remaining question is that which one is better (SparkR on yarn uses option 1 
for now):
1. Compile R packages from source on the driver node and pass the binary 
packages to executors;
2. Compile R packages from source on all executor nodes.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736
 ] 

Yanbo Liang commented on SPARK-17428:
-

cc [~shivaram] [~felixcheung]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org