(I sent this message to druid-user last week and got no response. Since it
is proposing making improvements to Druid, I thought maybe it would be
appropriate to resend here. Hope that's OK.)

We had a big outage in our Druid cluster last week.  We run our Druid
servers in Kubernetes, and our historicals use machine local SSDs for their
segment caches.  We made the unfortunate choice to have our production and
staging historicals share the same pool of machines, and today got bit by
this for the first time.

A production historical started up on a machine whose segment cache
contained segments from our staging cluster.  Our prod and staging clusters
use the same names for data sources.

This meant that these segments overshadowed production segments which
happened to have lower versions.  Worse, when
DruidCoordinatorCleanupOvershadowed kicked in, all of the production
segments that were overshadowed got used=false set, and quickly got dropped
from historicals. This ended up being the majority of our data.  We
eventually figured out what was going on and did a bunch of manual steps to
clean up (turning off and clearing the cache of the two historicals that
had staging segments on them, manually setting used=true for all entries in
druid_segments, waiting a long long time for data to re-download), but
figuring out what was going on was subtle (I was very lucky I had randomly
decided to read a lot of the code about how the `used` column works and how
versioned timelines are calculated just a few days before!).

(We were also lucky that we had turned off coordinator automatic killing
literally that morning!)

I feel like Druid should have been able to protect me from this to some
degree. (Yes, we are going to address the root cause by making it
impossible for prod and staging to reuse each others' disks.) Some thoughts
on changes that could have helped:

- Is the Druid standard to prepend the "cluster" name to the data source
name, so that conflicts like this are never possible?  We are certainly
tempted to do this now but nobody ever told us to. If that's the standard,
should it be documented?

- Should clusters have an optional name/namespace, and DataSegments have
that namespace recorded in it, and clusters refuse to handle segments they
find that are from a different namespace? This would be like the common
database setup where a single server/cluster has a set of database which
each have a set of tables.

- Should historicals refuse to announce segments that don't exist in the
druid_segments table, or should coordinators/brokers/etc refuse to pay
attention to segments announced *by historicals* that don't exist in the
druid_segments table.  I'm going to guess this is difficult to do in the
historical because the historical probably doesn't actually talk to the sql
DB at all? But maybe it could be done by coordinator and broker?

--dave

Reply via email to