Re: Surfacing additional issues on agent host to schedulers

Avinash Sridharan Wed, 21 Feb 2018 12:34:38 -0800

On Wed, Feb 21, 2018 at 11:18 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:


> Hi Avinash,
>
> We use haproxy of all outgoing traffic. For example, if instance of service
> A wants to talk to service B, what it does is actually call a
> "localhost:<some-port>" backed by the local haproxy instance, which then
> forwards the request to some instance of service B.
>
> In such a situation, if local haproxy is not functional, it's almost true
> that any thing making outgoing requests will not run properly, and we
> prefer to drain the host.
>

I am assuming the local HAProxy is not run within the purview of Mesos (it
could potentially be run as a stand-alone container starting Mesos 1.5)? So
how would Mesos even know that there is an issue with HAProxy and boil it
up? The problem here seems to be that the containers connectivity is
controlled by entities outside the Mesos domain. Reporting on problems with
these entities seems like a hard problem.

On option I can think of is to inject command health checks for the
containers that are querying the container's endpoitns through the
frontends exposed by the local HAProxy. This would all the detection of any
failure in HAProxy and will boiled up as a Mesos healthcheck failure??

>
> On Wed, Feb 21, 2018 at 9:45 AM, Avinash Sridharan <avin...@mesosphere.io>
> wrote:
>
> > On Tue, Feb 20, 2018 at 3:54 PM, James Peach <jor...@gmail.com> wrote:
> >
> > >
> > > > On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > In one of recent Mesos meet up, quite a couple of cluster operators
> had
> > > > expressed complaints that it is hard to model host issues with Mesos
> at
> > > the
> > > > moment.
> > > >
> > > > For example, in our environment, the only signal scheduler would know
> > is
> > > > whether Mesos agent has disconnected from the cluster. However, we
> > have a
> > > > family of other issues in real production which makes the hosts
> > > (sometimes
> > > > "partially") unusable. Examples include:
> > > > - traffic routing software malfunction (i.e, haproxy): Mesos agent
> does
> > > not
> > > > require this so scheduler/deployment system is not aware, but actual
> > > > workload on the cluster will fail;
> > >
> > Zhitao, could you elaborate on this a bit more? Do you mean the workloads
> > are being load-balanced by HAProxy and due to misconfiguration the
> > workloads are now unreachable and somehow the agent should be boiling up
> > these network issues? I am guessing in your case HAProxy is somehow
> > involved in providing connectivity to workloads on a given agent and
> > HAProxy is actually running on that agent?
> >
> >
> > > > - broken disk;
> > > > - other long running system agent issues.
> > > >
> > > > This email is looking at how can Mesos recommend best practice to
> > surface
> > > > these issues to scheduler, and whether we need additional primitives
> in
> > > > Mesos to achieve such goal.
> > >
> > > In the K8s world the node can publish "conditions" that describe its
> > status
> > >
> > >         https://kubernetes.io/docs/concepts/architecture/nodes/#
> > condition
> > >
> > > The condition can automatically taint the node, which could cause pods
> to
> > > automatically be evicted (ie. if they can't tolerate that specific
> > taint).
> > >
> > > J
> >
> >
> >
> >
> > --
> > Avinash Sridharan, Mesosphere
> > +1 (323) 702 5245
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>



-- 
Avinash Sridharan, Mesosphere
+1 (323) 702 5245

Re: Surfacing additional issues on agent host to schedulers

Reply via email to