[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003905#comment-15003905
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
---------------------------------------------

Exactly, the only "problematic" thing is how to get the ips into the container. 
When submitting a task to mesos/marathon, you submit the task to the mesos 
master, so at the time of submission you don't know where the task is going to 
run. When submitting a task to Marathon, this is what we do at Virdata (pseudo 
code):

- have a file called /etc/agent.sh, this file contains something like:

{noformat}
#!/bin/bash
AGENT_PRIVATE_IP=$(ifconfig ...)
{noformat}

When we submit the task to Marathon (we use Marathon), we do:

{noformat}
{
 ...
  "container": {
    "type": "docker",
    "docker": ...
  },
  "volumes": {
    "containerPath": "/etc/agent.sh",
    "hostPath": "/etc/agent.sh",
    "mode": "RO"
  }
}
{noformat}

In the container, {{source /etc/agent.sh}}.

In case of the executors having to know the addresses of every agent (so they 
can resolve back to the master), the simplest way would be to generate a file 
like this:

{noformat}
# /etc/mesos-hosts
10.100.1.10    mesos-agent1
10.100.1.11    mesos-agent2
...
{noformat}

And store it on hdfs. As long as the executor container can read from hdfs, 
you'll be sorted. Again, I think an MVE would be much clearer than this write 
up. Happy to provide such code but it may be difficult today.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11638
>                 URL: https://issues.apache.org/jira/browse/SPARK-11638
>             Project: Spark
>          Issue Type: Improvement
>          Components: Mesos, Spark Core
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>            Reporter: Radoslaw Gruchalski
>         Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{6666}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{6666}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for respective 
> akka versions. These add mentioned settings and ensure they work as 
> documented for Akka 2.4. In other words, these are future compatible.
> A part of that patch also exists in the patch for Spark, in the 
> {{org.apache.spark.util.AkkaUtils}} class. This is where Spark is creating 
> the driver and compiling the Akka configuration. That part of the patch tells 
> Akka to use {{bind-hostname}} instead of {{hostname}}, if 
> {{spark.driver.advertisedHost}} is given and use {{bind-port}} instead of 
> {{port}}, if {{spark.driver.advertisedPort}} is given. In such cases, 
> {{hostname}} and {{port}} are set to the advertised values, respectively.
> *Worth mentioning:* if {{spark.driver.advertisedHost}} or 
> {{spark.driver.advertisedPort}} isn't given, patched Spark reverts to using 
> the settings as they would be in case of non-patched {{akka-remote}}; exactly 
> for that purpose: if there is no patched {{akka-remote}} in use. Even if it 
> is in use, {{akka-remote}} will correctly handle undefined {{bind-hostname}} 
> and {{bind-port}}, as specified by Akka 2.4.x.
> h5. Akka versions in Spark (attached patches only)
> - Akka 2.3.4
>  - Spark 1.4.0
>  - Spark 1.4.1
> - Akka 2.3.11
>  - spark 1.5.0
>  - spark 1.5.1
>  - spark-1.6.0-SNAPSHOT
> h4. Taking on the problem, step 2: Spark services
> The fortunate thing is that every other Spark service is running over HTTP, 
> using an {{org.apache.spark.HttpServer}} class. This is where the second part 
> of the Spark patch comes into play. All other changes in the patch files 
> provide alternative {{advertised...}} ports for each of the following 
> services:
> - {{spark.broadcast.port}} -> {{spark.broadcast.advertisedPort}}
> - {{spark.fileserver.port}} -> {{spark.fileserver.advertisedPort}}
> - {{spark.replClassServer.port}} -> {{spark.replClassServer.advertisedPort}}
> What we are telling Spark here, is the following: if there is an alternative 
> {{advertisedPort}} setting given to this server instance, use that setting 
> for advertising the port.
> h4. Patches
> These patches are cleared by the Technicolor IP&L Team to be contributed back 
> under the Apache 2.0 License to Spark.
> All patches for versions from {{1.4.0}} to {{1.5.2}} can be applied directly 
> to the respective tag from Spark git repository. The {{1.6.0-master.patch}} 
> applies to git sha {{18350a57004eb87cafa9504ff73affab4b818e06}}.
> h4. Building Akka
> To build the required akka version:
> {noformat}
> AKKA_VERSION=2.3.4
> git clone https://github.com/akka/akka.git .
> git fetch origin
> git checkout v${AKKA_VERSION}
> git apply ...2.3.4.patch
> sbt package -Dakka.scaladoc.diagrams=false
> {noformat}
> h4. What is not supplied
> At the moment of contribution, we do not supply any unit tests. We would like 
> to contribute those but we may require some assistance.
> =====
> Happy to answer any questions and looking forward to any guidance which would 
> lead to have these included in the master Spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to