Re: Surfacing additional issues on agent host to schedulers

Zhitao Li Mon, 26 Feb 2018 09:26:05 -0800

Hi Avinash,

Sorry for the slow response.


On Wed, Feb 21, 2018 at 11:50 AM, Avinash Sridharan <avin...@mesosphere.io>
wrote:

> On Wed, Feb 21, 2018 at 11:18 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Hi Avinash,
> >
> > We use haproxy of all outgoing traffic. For example, if instance of
> service
> > A wants to talk to service B, what it does is actually call a
> > "localhost:<some-port>" backed by the local haproxy instance, which then
> > forwards the request to some instance of service B.
> >
> > In such a situation, if local haproxy is not functional, it's almost true
> > that any thing making outgoing requests will not run properly, and we
> > prefer to drain the host.
> >
>
> I am assuming the local HAProxy is not run within the purview of Mesos




You are right that this haproxy instance is not under purview of Mesos yet.
This is actually one of the part I'm seeking discussion with the community:
what position Mesos should take when managing these agents in people's
cluster w.r.t. additional pieces of agent softwares?



> (it
> could potentially be run as a stand-alone container starting Mesos 1.5)?


This is attractive, but unless Mesos or some framework on top can provide
good manageability story, only replacing the containerization part do not
solve the hard problem of end to end integration with other scheduler based
solution.


> So
> how would Mesos even know that there is an issue with HAProxy and boil it
> up? The problem here seems to be that the containers connectivity is
> controlled by entities outside the Mesos domain. Reporting on problems with
> these entities seems like a hard problem.
>

I don't know whether this is really "hard", or just subtle to model. I
think it depends on whether Mesos wants to an API system for container
orchestration/scheduling only, or an end to end cluster management stack.
If latter, I think a lot of people really need some kind of support here.


>
> On option I can think of is to inject command health checks for the
> containers that are querying the container's endpoitns through the
> frontends exposed by the local HAProxy. This would all the detection of any
> failure in HAProxy and will boiled up as a Mesos healthcheck failure??
>

Guarding Haproxy is attractive, but there are other issue conditions which
I do not think we can inject into each application containers. Plus, what
happens if one day our org migrated away from haproxy to things like
envoy/lstio? Injecting health check to every container again seems really
unnecessary from that perspective.


>
> >
> > On Wed, Feb 21, 2018 at 9:45 AM, Avinash Sridharan <
> avin...@mesosphere.io>
> > wrote:
> >
> > > On Tue, Feb 20, 2018 at 3:54 PM, James Peach <jor...@gmail.com> wrote:
> > >
> > > >
> > > > > On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > In one of recent Mesos meet up, quite a couple of cluster operators
> > had
> > > > > expressed complaints that it is hard to model host issues with
> Mesos
> > at
> > > > the
> > > > > moment.
> > > > >
> > > > > For example, in our environment, the only signal scheduler would
> know
> > > is
> > > > > whether Mesos agent has disconnected from the cluster. However, we
> > > have a
> > > > > family of other issues in real production which makes the hosts
> > > > (sometimes
> > > > > "partially") unusable. Examples include:
> > > > > - traffic routing software malfunction (i.e, haproxy): Mesos agent
> > does
> > > > not
> > > > > require this so scheduler/deployment system is not aware, but
> actual
> > > > > workload on the cluster will fail;
> > > >
> > > Zhitao, could you elaborate on this a bit more? Do you mean the
> workloads
> > > are being load-balanced by HAProxy and due to misconfiguration the
> > > workloads are now unreachable and somehow the agent should be boiling
> up
> > > these network issues? I am guessing in your case HAProxy is somehow
> > > involved in providing connectivity to workloads on a given agent and
> > > HAProxy is actually running on that agent?
> > >
> > >
> > > > > - broken disk;
> > > > > - other long running system agent issues.
> > > > >
> > > > > This email is looking at how can Mesos recommend best practice to
> > > surface
> > > > > these issues to scheduler, and whether we need additional
> primitives
> > in
> > > > > Mesos to achieve such goal.
> > > >
> > > > In the K8s world the node can publish "conditions" that describe its
> > > status
> > > >
> > > >         https://kubernetes.io/docs/concepts/architecture/nodes/#
> > > condition
> > > >
> > > > The condition can automatically taint the node, which could cause
> pods
> > to
> > > > automatically be evicted (ie. if they can't tolerate that specific
> > > taint).
> > > >
> > > > J
> > >
> > >
> > >
> > >
> > > --
> > > Avinash Sridharan, Mesosphere
> > > +1 (323) 702 5245
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245
>



-- 
Cheers,

Zhitao Li

Re: Surfacing additional issues on agent host to schedulers

Reply via email to