I have been researching service discovery on Mesos quite a bit lately, and
due to my background, may be making assumptions that don't apply to a Mesos
Datacenter. I've read through docs, and I have come up with two main
approaches to service discovery, and both appear to have strengths and
weaknesses, and I wanted to describe what I've seen here, as well as the
challenges as I understand them to perhaps have any misconceptions I may
have corrected.

Basically, I see two main approaches to the service discovery on Mesos. You
have the mesos-dns (https://github.com/mesosphere/mesos-dns) package with
is a DNS based service discovery, and then you have HAProxy based discovery
(which can be represented by both the haproxy-marathon-bridge (
https://github.com/mesosphere/marathon/blob/master/bin/haproxy-marathon-bridge)
script and the Bamboo project (https://github.com/QubitProducts/bamboo)).

HAProxy

With the HAProxy method, as I see it, you basically install HAProxy on
every node. The two above mentioned projects query marathon to determine
where the services are running, and then rewrite the haproxy config on
every node to allow basically every node to listen on a specific port, and
from there, that port will be forwarded, via round robin to the actual
node/port combinations where the services running.

So, let's use the example of a Hive Thrift server running in a Docker
container on port 10000.  Lets say you have a 5 node cluster, node1, node2,
etc. You spin that container up with instances = 3 in marathon, and
Marathon/docker run the container on node2, node3 and another on node2
 There is a bridged port to 10000 inside the container, that is tied to an
available port on the physical node. Perhaps one instance on node2 gets
30000 and the other instance gets 30001.  node3's instance is tied to port
30001.  So now you have 3 instances that exposed at

node2:30000  -> dockercontainer:10000
node2:30001 -> dockercontianer:10000
node3:30000 -> dockercontainer:10000

With the Haproxy setup, each node would get this in its local haproxy
config:

listen hivethrift-10000
  bind 0.0.0.0:10000
  mode tcp
  option tcplog
  balance leastconn
  server hivethrift-3 node2:30000 check
  server hivethrift-2 node2:30001 check
  server hivethrift-1 node3:30000 check

This would allow you to connect to any node in your cluster, on port 10000
and be served one of the three containers running your hive thrift server.

Pretty neat? However, there are some challenges here:

1. You now have a total of 65536 ports for your data center. This method is
port only, basically your whole cluster listens on a port and it's
dedicated to one service.  This actually makes sense in some ways because
if you think of Mesos as a cluster operating system, the limitations of
TCP/UDP are such that each kernel has that many ports.  There isn't a
cluster TCP or UDP, just TCP and UDP.  That still is a lot of ports,
however, you do need to be aware of the limitation and manage your ports.
Especially since that number isn't really the total number of available
ports. There are ports in that 65536 that are reserved for cluster
operations, and/or stuff like hdfs.

2. You are now essentially adding a hop to your traffic that could affect
sensitive applications.  At least with the haproxy-marathon-bridge script,
the settings for each application is static from the script (an update here
would be to allow timeout settings, and other haproxy options to be set per
application and managed somewhere, and I think that maybe what bamboo may
offer, just haven't dug in yet).  So the glaring issue I found was
specifically with the hive thrift service.  You connect, you run some
queries, all is well. However, if you submit a query, and it's a long query
(longer then the default 50000 ms timeout).  There may not be any packets
actually transferred in that time.  The client is ok with this, the server
is ok with this, however, haproxy sees no packets in it's timeout period,
and decides the connection is dead, closes it, and then you get problems.
I would imagine Thrift isn't the only service that may have situations like
this occur.  I need to do more research on how to get around this, there
may be some hope in hive 1.1.0 with thrift keep alives, however, not every
application service will have the option in the pipeline.

Mesos-DNS

This project came to my attention this week, and I am looking to get it
installed today to have hands on time with it.  Basically, it's a binary
that queries the mesos-master and develops A records that are hostnames,
based on the framework names, and SRV records based on the assigned ports.

This is where I get confused. I can see the A records being useful,
however, you would have to have your entire network be able to be use the
mesos-dns (including non-mesos systems).  Otherwise how would a client know
to connect to a .mesos domain name? Perhaps there should be a way to
integrate mesos-dns as the authoritative zone for .mesos in your standard
enterprise DNS servers. This also saves the configuration issues of having
to add DNS services to all the nodes.  I need to research DNS a bit more,
but couldn't you setup, say in bind, that any requests in .mesos are
forwarded to the mesos-dns service, and then sent through your standard dns
back to the client?  Wouldn't this be preferable to setting the .mesos name
services as the first DNS server and then THAT forwards off to your
standard enterprise DNS servers?

Another issue I see with DNS is it works well for hostnames, but what about
ports. Yes I see there there SRV records that will return the ports, but
how would that even be used?  Consider the hive thrift service example
above.  We could assume hive thrift would run on port 10000 on all nodes in
the cluster, and use the port, but then you run into the same issues as ha
proxy. You can't really specify a port via DNS in a jdbc connection URL can
you?  How do you get applications that want to connect to a integer port do
a DNS lookup to resolve a port? Or are we back to you have one cluster, and
you get 65536 ports for all the services you could want on that cluster?
Basically hard coding the ports? This then loses flexibility from a docker
port bridging perspective too, in that in my above haproxy example, all the
docker containers would have to expose port 10000 which would have caused a
conflict on node2.



Summary

So while I have a nice long email here, it seems I am either missing
something critical in how service discovery could work with a mesos
cluster, or there are still some pretty big difficulties that we need to
over come for an enterprise. Haproxy seems cool, and to work well except
for those "long running TCP connections" like thrift. I am at a loss how to
handle that. Mesos DNS is neat too, except for the port conflicts etc that
would occur if you used native ports on nodes, and if you didn't use native
ports, (mesos random ports) how do your applications know which port to
connect to (yes it's in the SRV record, however, how do you make apps aware
to look up a DNS record for a port?)

Am I missing something? How are others handling these issues?

Reply via email to