Hi folks, I was in the process of cleaning up some tech debt related to env variables in our code base. I created an epic ticket <https://issues.apache.org/jira/browse/MESOS-6341> to track. I searched relevant tickets fired previously, and found MESOS-3740 <https://issues.apache.org/jira/browse/MESOS-3740>. I did some digging on how we handle LIBPROCESS_IP currently, and here are my findings:
1) We always set LIBPROCESS_IP in the executor environment variables: https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L6793-L6796 This is not an issue for an executor that runs on host network. However, if the executor wants to run on non-host network (e.g., overlay), this might be problematic, because libprocess for the executor will try to bind to LIBPROCESS_IP, but the IP is not valid inside the container. 2) As mentioned in MESOS-3740 <https://issues.apache.org/jira/browse/MESOS-3740>, some user wants to run a Mesos framework in a Mesos container. The old style framework driver assumes a 2 way communication channel between the framework and the Mesos master. In order for the master to reach the framework running inside a Mesos container, the framework's libprocess should advertise its ip and port properly. This problem gets tricky because the networking for the Mesos container: 2.a) If the container uses host network, libprocess should bind to 0.0.0.0, and advertise itself using the agent ip and the relevant port 2.b) If the container has a routable ip (e.g., using calico or overlay), libprocess should still bind to 0.0.0.0, and advertise itself using the container ip and the relevant port. Currently, it binds to agent ip (which will fail), and advertise itself using agnet ip and the port in the container (which will fail as well) 2.c) If the container has a private ip (e.g., bridge), libprocess should still bind to 0.0.0.0, and advertise itself using the agent ip and _mapped_ host port. Currently, it binds to agent ip (which will fail), and advertise itself using agent ip and the port in the container (which will fail as well) Therefore, the workaround <https://github.com/mesosphere/mesos/commit/b9c622b53b3ffcc27911fcdcefc37a52ebe33bdd> suggested in MESOS-3740 <https://issues.apache.org/jira/browse/MESOS-3740> is not ideal. It does not consider 2.b) and 2.c) Libprocess now supports both LIBPROCESS_IP and LIBPROCESS_ADVERTISE_IP so the bind address does not have to be the address that is being advertised. For the 2.c) case, Mesos don't have a way to determine the advertise port (mapped port). This information is only known to the framework (which host port it'll use to serve as the mapped port for the libprocess). Given that, I think Mesos should not bindly set LIBPROCESS_IP to agent IP in executor environment variables. Framework should be the one that sets LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT appropriately if it tries to launch another Mesos framework so that Master can reach the new framework. If the framework just wants to launch a regular container that does not depends on libprocess, it should simply not set these env variables. Also, I think libprocess should always bind to 0.0.0.0, rather than doing a hostname lookup and bind to the IP found for the hostname. LIBPROCESS_ADVERTISE_IP can be used to overwrite the ip address it wants to advertise to peers. If that's not specified, it'll try to do a hostname lookup to guess a routable ip. Thoughts? - Jie
