[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150 ] Yanbo Liang commented on SPARK-17428: - Yeah, I agree to start with something simple and iterate later. I will do some experiments to verify whether it works well for the my use case. Thanks for all your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475943#comment-15475943 ] Shivaram Venkataraman commented on SPARK-17428: --- I think there are bunch of issues being discussed here. My initial take would be to add support for something simple and then iterate based on user feedback. Given that R users generally don't know / care much about package version numbers I'd say an initial cut that handles two flags in spark-submit (a) a list of package names and calls `install.packages` on each machine with them (b) a list of package tar.gz that are installed with `R CMD INSTALL` on each machine We can also make the package installs lazy, i.e. they only get run on a worker when there is a R worker process launched there. Will this meet the user needs you have in mind [~yanboliang] ? > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475664#comment-15475664 ] Jeff Zhang commented on SPARK-17428: Found another elegant way to specify version, using devtools https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages {code} require(devtools) install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org;) {code} > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475645#comment-15475645 ] Jeff Zhang commented on SPARK-17428: I just link the jira of python virtualenv. It seems R support virtualenv natively. Install.packages can specify the version, installation dest folder. And it is isolated cross users. I think there's 2 scenarios for SparkR environment. One is cluster has internet access, another is without internet access. If the cluster has internet access, then I think we can call install.packages directly. {code} install.packages("dplyr", lib="") library(dplyr, lib.loc="") {code} If the cluster doesn't have internet access, then the driver can first download these package tarball and add them through --files. And executor will try to compile and install these packages {code} install.packages(, repos = NULL, type="source", lib="") library(dplyr, lib.loc="") {code} For this scenario, if the package has dependencies, it would still try to download its dependencies from internet. Or user has to manually figure out its dependencies and add them in the spark app. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475630#comment-15475630 ] Jeff Zhang commented on SPARK-17428: Source code url needs to be specified for version. http://stackoverflow.com/questions/17082341/installing-older-version-of-r-package > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612 ] Felix Cheung commented on SPARK-17428: -- I don't think I see a way to specify a version number for install.packages in R? Python does compile code - installing packages with pip compiles the python scripts. https://www.google.com/search?q=pyc And also many packages have heavy native components which will not work without installing as root (or heavy hacking), eg. matplotlib, scipy. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593 ] Sun Rui commented on SPARK-17428: - I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain be ready on worker nodes. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474587#comment-15474587 ] Felix Cheung commented on SPARK-17428: -- Agree with above. And to be clear, packrat is still calling install.packages so it won't be different how this is handled regarding package directory (lib parameter to install.packages) or permission/access https://github.com/rstudio/packrat/blob/master/R/install.R#L69 We are likely going to prefer having private packages under the application directory in the case of YARN, so they will get clean up along with the application. It seems like the original point of this JIRA is around private packages and installation/deployment - I think we would agree we could handle that (or SparkR in YARN already can do that) My point is though the benefit of such package management system is really with the exact version that one can control. But even then, building packages from source on worker machine could be problematic (this applies both to packrat, or calls to install.packages): https://rstudio.github.io/packrat/limitations.html - I'm not sure we should assume all worker machines in enterprises have C compiler or that the user running Spark have permission to build source code. I don't know where we are at with PySpark but I'd be very interested in seeing how that is resolved - I think both Python and R face similar constraints in terms of deployment/package building, versioning, heterogeneous machine architecture and so on. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474193#comment-15474193 ] Shivaram Venkataraman commented on SPARK-17428: --- I agree with [~sunrui] - Just to make it more concrete, something like {code} install.packages("dplyr", lib="/tmp/") library(dplyr, lib.loc="/tmp") {code} creates `/tmp/dplyr` and puts the package there (no root required for this). We can also automatically search `/tmp` for packages by adding it to `.libPaths()` as well. Not that /tmp is just an example here and we can replace this with YARN local dir etc. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474162#comment-15474162 ] Sun Rui commented on SPARK-17428: - for your point 1, If we specify a normal temporary directory for installing on executor nodes,seems no root privilege is required for your point 2, If we specify a normal temporary directory for installing on executor nodes, no pollution to the executors'R libraries. for your point 3, this is a concern. Typically for client deployment mode, where the driver may be out of the cluster, and may have a different architecture from nodes of the cluster. This needs more discussion. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643 ] Yanbo Liang commented on SPARK-17428: - [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may be need more experiments. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472995#comment-15472995 ] Felix Cheung commented on SPARK-17428: -- PySpark in fact has a on-going PR on supporting `virtualenv` and `wheel` but I don't think that is fully resolved yet for Python. I think it is an interesting use case. The advantage of a package management tool is the ability to control exact version of packages - install.packages would just pick the latest which could cause issues between different nodes in the cluster. I also think we need think deeper on this - I have run into issues a lot with Python or R packages that require native dependencies and compilations, and often time only running as root. I'm not sure if we want Spark jobs to run as root. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471040#comment-15471040 ] Shivaram Venkataraman commented on SPARK-17428: --- Yeah so it should be relatively easy to install any R package from CRAN / a set of repos to a specified directory. The `lib` option at https://stat.ethz.ch/R-manual/R-devel/library/utils/html/install.packages.html can be used for this. So one way to do this would be to take in the names of R packages and / or tar.gz files and invoke `install.packages` with the appropriate YARN local dir or Mesos local dir passed in as `lib` I think [~sunrui] has a good point about compiling packages at one machine vs. many machines. I think compiling only on driver will save some work -- Just as a point of reference how do we handle source packages in PySpark ? > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469806#comment-15469806 ] Sun Rui commented on SPARK-17428: - [~yanboliang] Allowing pass dependent R packages to executors is a convenient feature for users. However, maybe there is no need for a third-party R package for isolation because the underlying cluster managers may have built-in support for it. For example, YARN Local Resource , and Mesos Sandbox. Actually, SparkR on YARN has already supported passing dependent R packages to executors. The remaining question is that which one is better (SparkR on yarn uses option 1 for now): 1. Compile R packages from source on the driver node and pass the binary packages to executors; 2. Compile R packages from source on all executor nodes. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736 ] Yanbo Liang commented on SPARK-17428: - cc [~shivaram] [~felixcheung] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org