[Zeek-Dev] Cluster Controller Framework Thoughts

Vlad Grigorescu Tue, 01 Sep 2020 09:02:11 -0700

Rather lengthy post follows. In case people didn't see it, much of this
references this document:
https://docs.google.com/document/d/1r0wXnihx4yESOpLJ87Wh2g1V-aHOFUkbgiFe8RHJpZo/edit


The past couple of weeks, I upgraded all of our standalone Zeek clusters
(about a dozen) to Zeek 3.2 and I also wanted to use this as an opportunity
to try to move over to the new Supervisor framework.

I think there are cases where the Supervisor framework is a great fit, and
for being at an early stage, it really does work well. However, I ended up
not using the Supervisor framework at all, and just running everything
directly from systemd (with no zeekctl). I realize that "just use systemd"
isn't a solution for everyone, but I'd like to share some thoughts in order
to suss out whether there's something inherently different about our
environment/setup, or whether the Supervisor and Cluster Controller
frameworks need some tweaking or rethinking.

Configuring Supervisor was more difficult. In both cases, I need to
configure my cluster layout. I do this via an Ansible template. Then, I
need to tell the Zeek processes their working directory, CPU pinning
assignment, environment variables, interface to sniff on, etc. These are
all things that systemd can do natively, and I found that it was much
easier to do it via systemd templates.

Monitoring the cluster was also more difficult with Supervisor. When I
check the status of the cluster, I just see a bunch of identically-named
processes. I can't tell what's using excessive memory, if I want to check
the logs for a process, I need to go manually look at stderr and stdout in
its working directory.

With systemd, I see something like this:

   Active: active since Fri 2020-08-28 08:13:19 PDT; 4 days ago
>     Tasks: 247
>    Memory: 3.1G
>    CGroup: /zeek.slice
>            ├─zeek-worker.slice
>            │ ├─zeek_worker@1.service
>            │ │ └─55473 /usr/local/zeek/bin/zeek site/local -i ens1
>            │ ├─zeek_worker@2.service
>            │ │ └─55253 /usr/local/zeek/bin/zeek site/local -i ens1
>           ..........
>            ├─zeek_cluster@control.service
>            │ └─55074 /usr/local/zeek/bin/zeek site/local
>            ├─zeek_cluster@logger.service
>            │ └─55125 /usr/local/zeek/bin/zeek site/local
>            ├─zeek_cluster@manager.service
>            │ └─55114 /usr/local/zeek/bin/zeek site/local
>            └─zeek_cluster@proxy.service
>              └─55116 /usr/local/zeek/bin/zeek site/local
>

I immediately know what process is what, per-node logs go to my standard
logging infrastructure, and I can restart a single node if it's
misbehaving. I even took this a step further, and configured systemd to
kill and restart any worker if it ever swaps, but to allow the logger,
manager, etc. to swap.

Ultimately, given the choice between systemd + supervisor versus just
systemd, for our use case, just systemd gave us some distinct benefits and
reduced complexity. My intent is not to be critical of the existing work,
like I said, I think there are cases where it's a great fit. However, I did
also want to share my experiences and thoughts after running it for a while.

That experience gave me a new perspective on the Cluster Controller
framework. This is essentially another layer of orchestration over the
cluster, except there are specific things that it does not do, namely it's
not responsible for keeping the following Zeek items in sync: binaries,
packages, and scripts. So, in order to use this, I need some layer which
will distribute these and ensure that they match across a cluster. Once
again, it feels like I'm left with a choice between one orchestration tool
+ the cluster controller framework, versus just using a single
orchestration tool to keep these files in sync and handle cluster
stop/start/restarts.

Take, for instance, a cluster restart. This could either be planned (e.g. I
upgraded a binary or script and want to have that change take effect), or
unplanned (something crashed and the cluster is now degraded). In the first
case, I'd need to use my orchestration tool to push out the file changes,
and then it becomes trivial to have that handle the restart as well. In
fact, having my orchestration tool do that also enables me to do things
like fallback to the previous version should the upgrade go wrong. In the
second case, rather than build another layer on top of it, I'd like to see
more resiliency in the Cluster framework.

There are other capabilities such as coordinated starting and stopping of
the cluster, but the only reason I do those actions is because I need to
(e.g. the system running the logger is going to reboot, and I know that if
I don't stop the whole thing, logs will queue in memory and workers will
crash).

So, I'm left wondering if the Cluster Controller framework is really
solving the right problems. Rather than worry about coordinated restarts,
I'd like to see the need for such restarts go away with better resiliency.
If I need to reboot a system in a cluster, and it's running the manager and
logger, I'd like to see another system in the cluster get promoted to being
the manager and logger, and all the nodes to start talking to that instead.
I could see a design where all systems run managers and loggers, with only
one active at a time. If the active manager shares state with the standby
managers, a failover can happen seamlessly.

It's true that the world has moved on from the model of broctl and zeekctl,
but I feel like that's because they've moved towards better resiliency in
distributed systems, and it feels a bit like the Cluster Controller
framework is trying to take the old zeekctl features and get them to fit
into a new model.

  --Vlad

_______________________________________________
zeek-dev mailing list -- zeek-dev@lists.zeek.org
To unsubscribe send an email to zeek-dev-le...@lists.zeek.org

[Zeek-Dev] Cluster Controller Framework Thoughts

Reply via email to