[ https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003905#comment-15003905 ]
Radoslaw Gruchalski commented on SPARK-11638: --------------------------------------------- Exactly, the only "problematic" thing is how to get the ips into the container. When submitting a task to mesos/marathon, you submit the task to the mesos master, so at the time of submission you don't know where the task is going to run. When submitting a task to Marathon, this is what we do at Virdata (pseudo code): - have a file called /etc/agent.sh, this file contains something like: {noformat} #!/bin/bash AGENT_PRIVATE_IP=$(ifconfig ...) {noformat} When we submit the task to Marathon (we use Marathon), we do: {noformat} { ... "container": { "type": "docker", "docker": ... }, "volumes": { "containerPath": "/etc/agent.sh", "hostPath": "/etc/agent.sh", "mode": "RO" } } {noformat} In the container, {{source /etc/agent.sh}}. In case of the executors having to know the addresses of every agent (so they can resolve back to the master), the simplest way would be to generate a file like this: {noformat} # /etc/mesos-hosts 10.100.1.10 mesos-agent1 10.100.1.11 mesos-agent2 ... {noformat} And store it on hdfs. As long as the executor container can read from hdfs, you'll be sorted. Again, I think an MVE would be much clearer than this write up. Happy to provide such code but it may be difficult today. > Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker > with Bridge networking > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-11638 > URL: https://issues.apache.org/jira/browse/SPARK-11638 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core > Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 > Reporter: Radoslaw Gruchalski > Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, > 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch > > > h4. Summary > Provides {{spark.driver.advertisedPort}}, > {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and > {{spark.replClassServer.advertisedPort}} settings to enable running Spark in > Mesos on Docker with Bridge networking. Provides patches for Akka Remote to > enable Spark driver advertisement using alternative host and port. > With these settings, it is possible to run Spark Master in a Docker container > and have the executors running on Mesos talk back correctly to such Master. > The problem is discussed on the Mesos mailing list here: > https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E > h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door > In order for the framework to receive orders in the bridged container, Mesos > in the container has to register for offers using the IP address of the > Agent. Offers are sent by Mesos Master to the Docker container running on a > different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} > would advertise itself using the IP address of the container, something like > {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a > different host, it's a different machine. Mesos 0.24.0 introduced two new > properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and > {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's > address to register for offers. This was provided mainly for running Mesos in > Docker on Mesos. > h4. Spark - how does the above relate and what is being addressed here? > Similar to Mesos, out of the box, Spark does not allow to advertise its > services on ports different than bind ports. Consider following scenario: > Spark is running inside a Docker container on Mesos, it's a bridge networking > mode. Assuming a port {{6666}} for the {{spark.driver.port}}, {{6677}} for > the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and > {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to > Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the > container ports. Starting the executors from such container results in > executors not being able to communicate back to the Spark Master. > This happens because of 2 things: > Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} > transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port > different to what it bound to. The settings discussed are here: > https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. > These do not exist in Akka {{2.3.x}}. Spark driver will always advertise > port {{6666}} as this is the one {{akka-remote}} is bound to. > Any URIs the executors contact the Spark Master on, are prepared by Spark > Master and handed over to executors. These always contain the port number > used by the Master to find the service on. The services are: > - {{spark.broadcast.port}} > - {{spark.fileserver.port}} > - {{spark.replClassServer.port}} > all above ports are by default {{0}} (random assignment) but can be specified > using Spark configuration ( {{-Dspark...port}} ). However, they are limited > in the same way as the {{spark.driver.port}}; in the above example, an > executor should not contact the file server on port {{6677}} but rather on > the respective 31xxx assigned by Mesos. > Spark currently does not allow any of that. > h4. Taking on the problem, step 1: Spark Driver > As mentioned above, Spark Driver is based on {{akka-remote}}. In order to > take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and > {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile > with Akka 2.4.x yet. > What we want is the back port of mentioned {{akka-remote}} settings to > {{2.3.x}} versions. These patches are attached to this ticket - > {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for respective > akka versions. These add mentioned settings and ensure they work as > documented for Akka 2.4. In other words, these are future compatible. > A part of that patch also exists in the patch for Spark, in the > {{org.apache.spark.util.AkkaUtils}} class. This is where Spark is creating > the driver and compiling the Akka configuration. That part of the patch tells > Akka to use {{bind-hostname}} instead of {{hostname}}, if > {{spark.driver.advertisedHost}} is given and use {{bind-port}} instead of > {{port}}, if {{spark.driver.advertisedPort}} is given. In such cases, > {{hostname}} and {{port}} are set to the advertised values, respectively. > *Worth mentioning:* if {{spark.driver.advertisedHost}} or > {{spark.driver.advertisedPort}} isn't given, patched Spark reverts to using > the settings as they would be in case of non-patched {{akka-remote}}; exactly > for that purpose: if there is no patched {{akka-remote}} in use. Even if it > is in use, {{akka-remote}} will correctly handle undefined {{bind-hostname}} > and {{bind-port}}, as specified by Akka 2.4.x. > h5. Akka versions in Spark (attached patches only) > - Akka 2.3.4 > - Spark 1.4.0 > - Spark 1.4.1 > - Akka 2.3.11 > - spark 1.5.0 > - spark 1.5.1 > - spark-1.6.0-SNAPSHOT > h4. Taking on the problem, step 2: Spark services > The fortunate thing is that every other Spark service is running over HTTP, > using an {{org.apache.spark.HttpServer}} class. This is where the second part > of the Spark patch comes into play. All other changes in the patch files > provide alternative {{advertised...}} ports for each of the following > services: > - {{spark.broadcast.port}} -> {{spark.broadcast.advertisedPort}} > - {{spark.fileserver.port}} -> {{spark.fileserver.advertisedPort}} > - {{spark.replClassServer.port}} -> {{spark.replClassServer.advertisedPort}} > What we are telling Spark here, is the following: if there is an alternative > {{advertisedPort}} setting given to this server instance, use that setting > for advertising the port. > h4. Patches > These patches are cleared by the Technicolor IP&L Team to be contributed back > under the Apache 2.0 License to Spark. > All patches for versions from {{1.4.0}} to {{1.5.2}} can be applied directly > to the respective tag from Spark git repository. The {{1.6.0-master.patch}} > applies to git sha {{18350a57004eb87cafa9504ff73affab4b818e06}}. > h4. Building Akka > To build the required akka version: > {noformat} > AKKA_VERSION=2.3.4 > git clone https://github.com/akka/akka.git . > git fetch origin > git checkout v${AKKA_VERSION} > git apply ...2.3.4.patch > sbt package -Dakka.scaladoc.diagrams=false > {noformat} > h4. What is not supplied > At the moment of contribution, we do not supply any unit tests. We would like > to contribute those but we may require some assistance. > ===== > Happy to answer any questions and looking forward to any guidance which would > lead to have these included in the master Spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org