[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15888


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r8285
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkR.R ---
@@ -0,0 +1,36 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+context("functions in sparkR.R")
+
+test_that("sparkCheckInstall", {
+  # "local, yarn-client, mesos-client" mode, SPARK_HOME was set correctly,
+  # and the SparkR job was submitted by "spark-submit"
+  sparkHome <- paste0(tempdir(), "/", "sparkHome")
+  dir.create(sparkHome)
+  master <- ""
+  deployMode <- ""
+  expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))
+  unlink(sparkHome, recursive = TRUE)
+
+  # "yarn-cluster, mesos-cluster" mode, SPARK_HOME was not set,
+  # and the SparkR job was submitted by "spark-submit"
+  sparkHome <- ""
+  master <- ""
+  deployMode <- ""
+  expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))
--- End diff --

Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-20 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88808464
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -373,8 +373,13 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  deployMode <- ""
+  if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
+deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
+  }
--- End diff --

fair enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-20 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88808475
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
 #
 # @param sparkHome directory to find Spark package.
 # @param master the Spark master URL, used to check local or remote mode.
+# @param deployMode whether to deploy your driver on the worker nodes 
(cluster)
+#or locally as an external client (client).
 # @return NULL if no need to update sparkHome, and new sparkHome otherwise.
-sparkCheckInstall <- function(sparkHome, master) {
+sparkCheckInstall <- function(sparkHome, master, deployMode) {
   if (!isSparkRShell()) {
 if (!is.na(file.info(sparkHome)$isdir)) {
   msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
-msg <- paste0("Spark not found in SPARK_HOME: ",
-  sparkHome)
+  if (isMasterLocal(master)) {
+msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (isClientMode(master) || deployMode == "client") {
--- End diff --

ah yes..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88763342
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -373,8 +373,13 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  deployMode <- ""
+  if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
+deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
+  }
--- End diff --

you could instead just do `deployMode <- 
sparkConfigMap[["spark.submit.deployMode"]]`
it will be set to NULL if the key is not set.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-16 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88214305
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -550,24 +566,28 @@ processSparkPackages <- function(packages) {
 #
 # @param sparkHome directory to find Spark package.
 # @param master the Spark master URL, used to check local or remote mode.
+# @param deployMode whether to deploy your driver on the worker nodes 
(cluster) or
+#locally as an external client (client).
 # @return NULL if no need to update sparkHome, and new sparkHome otherwise.
-sparkCheckInstall <- function(sparkHome, master) {
+sparkCheckInstall <- function(sparkHome, master, deployMode) {
   if (!isSparkRShell()) {
 if (!is.na(file.info(sparkHome)$isdir)) {
   msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
+  if (isMasterLocal(master)) {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (nzchar(deployMode) && deployMode == "client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
+NULL
--- End diff --

@felixcheung 
If we submit SparkR jobs by ```spark-submit```, then ```master``` and 
```deployMode``` should be empty. ```sparkHome``` should be set correctly in 
```client``` and ```local``` mode, but empty in ```cluster``` mode. 
```sparkCheckInstall``` return NULL in all these scenarios.
If we start SparkR by ```SparkR.session``` and pass in corresponding 
arguments, we should download Spark package when ```master = local[*]```. If 
```deployMode``` was set with ```client``` and w/o ```SPARK_HOME``` set 
correctly, we should stop due to misconfigure. Otherwise(deploy with ```yarn``` 
or ```mesos``` cluster mode) return NULL here.
From Spark 2.0, ```master = yarn-cluster``` was deprecated by ```master = 
yarn, deployMode = cluster```. Since ```SparkR.session``` was introduced from 
2.0, I don't think we should support the old convention here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88127026
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -558,16 +558,18 @@ sparkCheckInstall <- function(sparkHome, master) {
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
+  if (isMasterLocal(master)) {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (master == "yarn-client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
--- End diff --

also what happens to "mesos"? this is what we are saying we should support 
in SparkR 2.1.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-15 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88032672
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -558,16 +558,18 @@ sparkCheckInstall <- function(sparkHome, master) {
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
--- End diff --

```master``` will be always empty if the job was submitted by 
```spark-submit```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-15 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15888#discussion_r88032786
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -558,16 +558,18 @@ sparkCheckInstall <- function(sparkHome, master) {
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
+  if (isMasterLocal(master)) {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (master == "yarn-client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
--- End diff --

```yarn-cluster``` mode, will do nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15888: [SPARK-18444][SPARKR] SparkR running in yarn-clus...

2016-11-15 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/15888

[SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not 
download Spark package.

## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark 
package from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
..
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in 
a remote host of the yarn cluster rather than in the client host. The JVM comes 
up first and the R process then connects to it. So in such cases we should 
never have to download Spark as Spark is already running.

## How was this patch tested?
Offline test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-18444

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15888


commit 16aa40086f8e2e58f7e3d7c3ec95a2e4d5967e5b
Author: Yanbo Liang 
Date:   2016-11-15T10:01:38Z

SparkR running in yarn-cluster mode should not download Spark package.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org