Repository: spark
Updated Branches:
  refs/heads/master cc67bd573 -> ef1622899


[SPARK-20989][CORE] Fail to start multiple workers on one host if external 
shuffle service is enabled in standalone mode

## What changes were proposed in this pull request?

In standalone mode, if we enable external shuffle service by setting 
`spark.shuffle.service.enabled` to true, and then we try to start multiple 
workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and 
then run `sbin/start-slaves.sh`), we can only launch one worker on each host 
successfully and the rest of the workers fail to launch.
The reason is the port of external shuffle service if configed by 
`spark.shuffle.service.port`, so currently we could start no more than one 
external shuffle service on each host. In our case, each worker tries to start 
a external shuffle service, and only one of them succeeded doing this.

We should give explicit reason of failure instead of fail silently.

## How was this patch tested?
Manually test by the following steps:
1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
2. SET `spark.shuffle.service.enabled` to `true` in `conf/spark-defaults.conf`;
3. Run `sbin/start-all.sh`.

Before the change, you will see no error in the command line, as the following:
```
starting org.apache.spark.deploy.master.Master, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```
And you can see in the webUI that only one worker is running.

After the change, you get explicit error messages in the command line:
```
starting org.apache.spark.deploy.master.Master, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls to: 
xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls groups 
to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls 
groups to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 
'sparkWorker' on port 63354.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
localhost:      at scala.Predef$.require(Predef.scala:224)
localhost:      at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:      at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls to: 
xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls groups 
to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls 
groups to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 
'sparkWorker' on port 63359.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
localhost:      at scala.Predef$.require(Predef.scala:224)
localhost:      at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:      at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls to: 
xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls groups 
to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls 
groups to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 
'sparkWorker' on port 63360.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
localhost:      at scala.Predef$.require(Predef.scala:224)
localhost:      at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:      at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```

Author: Xingbo Jiang <xingbo.ji...@databricks.com>

Closes #18290 from jiangxb1987/start-slave.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef162289
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef162289
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef162289

Branch: refs/heads/master
Commit: ef1622899ffc6ab136102ffc6bcc714402e6f334
Parents: cc67bd5
Author: Xingbo Jiang <xingbo.ji...@databricks.com>
Authored: Tue Jun 20 17:17:21 2017 +0800
Committer: Wenchen Fan <wenc...@databricks.com>
Committed: Tue Jun 20 17:17:21 2017 +0800

----------------------------------------------------------------------
 .../scala/org/apache/spark/deploy/worker/Worker.scala    | 11 +++++++++++
 sbin/spark-daemon.sh                                     |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ef162289/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
index 1198e3c..bed4745 100755
--- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
@@ -742,6 +742,17 @@ private[deploy] object Worker extends Logging {
     val args = new WorkerArguments(argStrings, conf)
     val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, 
args.cores,
       args.memory, args.masters, args.workDir, conf = conf)
+    // With external shuffle service enabled, if we request to launch multiple 
workers on one host,
+    // we can only successfully launch the first worker and the rest fails, 
because with the port
+    // bound, we may launch no more than one external shuffle service on each 
host.
+    // When this happens, we should give explicit reason of failure instead of 
fail silently. For
+    // more detail see SPARK-20989.
+    val externalShuffleServiceEnabled = 
conf.getBoolean("spark.shuffle.service.enabled", false)
+    val sparkWorkerInstances = 
scala.sys.env.getOrElse("SPARK_WORKER_INSTANCES", "1").toInt
+    require(externalShuffleServiceEnabled == false || sparkWorkerInstances <= 
1,
+      "Starting multiple workers on one host is failed because we may launch 
no more than one " +
+        "external shuffle service on each host, please set 
spark.shuffle.service.enabled to " +
+        "false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.")
     rpcEnv.awaitTermination()
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ef162289/sbin/spark-daemon.sh
----------------------------------------------------------------------
diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh
index c227c98..6de67e0 100755
--- a/sbin/spark-daemon.sh
+++ b/sbin/spark-daemon.sh
@@ -143,7 +143,7 @@ execute_command() {
       # Check if the process has died; in that case we'll tail the log so the 
user can see
       if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
         echo "failed to launch: $@"
-        tail -2 "$log" | sed 's/^/  /'
+        tail -10 "$log" | sed 's/^/  /'
         echo "full log in $log"
       fi
   else


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to