Re: Pull #9224 - Druid Coordinator Pause Feature

Lucas Capistrant Tue, 21 Jan 2020 06:40:40 -0800

Maytas,

Appreciate the notes. Unfortunately we've considered both those options and
they will not work for us.


The first approach is a bit too casual for our liking. We need a
guarantee that coordination won't run regardless of hour long our HDFS
outage is. Plus we would have to do multiple rolling restarts of the
coordinators to roll this config out and back before and after maintenance
which we'd rather not do.

For the second, we know Historical nodes can run without a coordinator
being up. However, we need to restart all of our historical nodes during
maintenance. And not having the coordinator up seems to add 2 minutes to
each Historical startup because there are multiple attempts to contact the
coordinator for lookup configs before giving up and starting anyways. With
nearly 100 nodes to restart, those minutes add up and we want to keep the
restart process as quick as possible.

On Mon, Jan 20, 2020 at 7:05 PM Maytas Monsereenusorn <mayt...@gmail.com>
wrote:

> I'm still pretty new to Druid and might be wrong but I notice the following
> points in the documentation for the Coordinator (
> https://druid.apache.org/docs/latest/design/coordinator.html):
>
> *"The Druid Coordinator runs periodically and the time between each run is
> a configurable parameter. Each time the Druid Coordinator runs, it assesses
> the current state of the cluster before deciding on the appropriate actions
> to take."*
> Is it possible to use this configuration and set to a really large number
> to do what you wanted?
>
> "
> *If the Druid Coordinator is not started up, no new segments will be loaded
> in the cluster and outdated segments will not be dropped. However, the
> Coordinator process can be started up at any time, and after a configurable
> delay, will start running Coordinator tasks. This also means that if you
> have a working cluster and all of your Coordinators die, the cluster will
> continue to function, it just won’t experience any changes to its data
> topology."*From this, it seems like the Coordinator does not to be running
> both when other processes is starting up and if they are already up.
>
> Best Regards,
> Maytas
>
> On Mon, Jan 20, 2020 at 2:10 PM Will Lauer <wla...@verizonmedia.com
> .invalid>
> wrote:
>
> > I have no idea about the implementation, but the concept is certainly one
> > we have been looking for for quite a while in the several clusters I
> > manage. I'm excited to see this capability added to the system.
> >
> > Will
> >
> > On Mon, Jan 20, 2020, 1:55 PM Lucas Capistrant <
> capistrant.lu...@gmail.com
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Looking for some feedback on the idea of creating a new dynamic config
> > for
> > > the coordinator that allows cluster admins to pause coordination by
> > setting
> > > the new config to true (default is false). By pause coordination, I
> mean
> > to
> > > skip running any coordinator helpers every time the coordinator runs.
> > Some
> > > more details are included below as well as a link to a PR with the
> > initial
> > > implementation that I came up with. Any feedback helps, we want to make
> > > sure we are not overlooking any negative side effects!
> > >
> > > My organization is preparing to undergo some heavy maintenance on our
> > HDFS
> > > cluster that backs our production Druid clusters. This involves HDFS
> > > downtime. Our plan was to stop the coordinators and overlords and
> rolling
> > > restart the Historical nodes during the outage to lay down the new site
> > > files and retain a static picture of the world for client queries to
> run
> > > against. During our tests in stage we realized the Historical's check
> in
> > > with the coordinator when starting up. Therefore, we wanted to find a
> way
> > > to leave the coordinator up, but not actually coordinate segments on
> the
> > > cluster, try run kill tasks, etc. (because HDFS is offline and we don't
> > > want to be talking with it until we know it is back up and healthy).
> > Thus,
> > > Pull
> > > 9224 <https://github.com/apache/druid/pull/9224/files> was born. This
> > > seemed like an easy and effective way to halt coordination and keep the
> > API
> > > up.
> > >
> > > We've done some small scale testing in a dev environment and I am
> > currently
> > > looking into writing some time of integration test that flexes this
> code
> > > path. Despite the changes perceived simplicity, it would be nice to
> have
> > > something there.
> > >
> > > Thanks!
> > > Lucas Capistrant
> > >
> >
>

Re: Pull #9224 - Druid Coordinator Pause Feature

Reply via email to